-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<DLRMv2> Add return of construct_model in dlrm_main.py #174
Open
JunxiChhen
wants to merge
89
commits into
intel:main
Choose a base branch
from
JunxiChhen:patch-1
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* revert bf16 changes (#488) * Add partials and spec yml for the end2end DLSA pipeline (#460) * Add partials and specs for the end2end DLSA pipeline * Add missing end line * Update name to include ipex * update specs to have use the public image as a base on one and SPR for the other * Dockerfile updates for the updated DLSA repo * Update pip install list * Rename to public * Removing partials that aren't used anymore * Fixes for 'kmp-blocktime' env var (#493) * Fixes for 'kmp-blocktime' env var Signed-off-by: Abolfazl Shahbazi <[email protected]> * update per review feedback Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'kmp-blocktime' for mlperf-gnmt (#494) * Add 'kmp-blocktime' for mlperf-gnmt Signed-off-by: Abolfazl Shahbazi <[email protected]> * Remove duplicate parameter definition Signed-off-by: Abolfazl Shahbazi <[email protected]> * add sample_input for resnet50 training (#495) * remove the case when fragment_size not equal args.batch_size (#500) * Changed the transformer_mlperf fp32 model so that we can fuse the ops… (#389) * Changed the transformer_mlperf fp32 model so that we can fuse the ops in the model, and also minor changes for python3 * Changed the transformer_mlperf int8 model so that we can fuse the ops in the model, and also minor changes for python3 * SPR updates for WW12, 2022 (#492) * SPR updates for WW12, 2022 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update for PyTorch SPR WW2022-12 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update pytorch base for SPR too Signed-off-by: Abolfazl Shahbazi <[email protected]> * Stick with specific 'keras-nightly' version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Updates per code review Signed-off-by: Abolfazl Shahbazi <[email protected]> * update maskrcnn training_multinode.sh (#502) * Fixed a bug in the transformer_mlperf model threads setting (#482) * Fixed a bug in the transformer_mlperf model threads setting * Fix failing tests Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * Added the default threads setting for transformer_mlperf inference in… (#504) * Added the default threads setting for transformer_mlperf inference in case there is no command line input * Fix unit tests Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * PyTorch Image Classification TL notebook (#490) * Adds new TL notebook with documentation * Added newline * Added to main TL README * Small fixes * Updated for review feedback * Added more models and a download limit arg * Removed py3.9 requirement and changed default model * Adds Kitti torchvision dataset to TL notebook (#512) * Adds Kitti torchvision dataset to TL notebook * Fixed citations formatting * update maskrcnn model (#515) * minor update. (#465) * Create unit-test github action workflow (#518) * Create unit-test github action workflow Tested here: https://github.com/sriester/frameworks.ai.models.intel-models/runs/6089350443?check_suite_focus=true Runs tox py.test on push. * Containerize job * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Added login credentials to docker Trying to fix pull rate issue * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml Changed pip install command. * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml Changed docker credentials to imzbot * Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519) Signed-off-by: Abolfazl Shahbazi <[email protected]> * update distilbert model to 4.18 transformers and enable int8 path (#521) * rnnt: use launcher to set output file path and name (#524) * Update BareMetalSetup.md (#526) Always use the latest torchvision * Reduce memory usage for dlrm acc test (#527) * updatedistilbert with text_classification (#529) * add patch for distilbert (#530) * Update the model-builder dockerfile to use ubuntu 20.04 (#532) * Add script for coco training dataset processing (#525) * and update tensorflow ssd-resnet34 training dataset instructions * update patch (#533) Co-authored-by: Wang, Chuanqi <[email protected]> * [RNN-T training] Enable FP32 gemm using oneDNN (#531) * Update the Readme guide for distilbert (#534) * Update the Readme guide for distilbert * Fix accuracy grep bug, and grep accuracy for distilbert Co-authored-by: Weizhuo Zhang <[email protected]> * Update end2end public dockerfile to look for IPEX in the conda directory (#535) * Notebook to script conversion example (#516) * Add notebook script conversion example * Fixed doc * Replaces custom preprocessor with built-in one * Changed tag to remove_for_custom_dataset * Add URL check prior to calling urlretrieve (#538) * Add URL check prior to calling urlretrieve Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo Signed-off-by: Abolfazl Shahbazi <[email protected]> * disable for ssd since fused cat cat kernel is slow (#537) * fix bug when adding steps in rnnt inference (#528) * Fix and updates for TensorFlow WW18-2022 SPR (#542) * Fix and updates for TensorFlow WW18-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix TensorFlow SPR nightly versions Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update pre-trained models download URLs Signed-off-by: Abolfazl Shahbazi <[email protected]> * Intall Python 3.8 development tools Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix OpenMPI install and setup Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Horovod Installaion for SPR and CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Python3.8 version for CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo in TensorFlow 3d-unet partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a broken partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add TCMalloc to TF base container for SPR and remove OpenSSL Signed-off-by: Abolfazl Shahbazi <[email protected]> * Remove some repositories Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'matplotlib' for '3d-unet' Signed-off-by: Abolfazl Shahbazi <[email protected]> * switch to build OpenMPI due to issue in Market Place provided version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYTORCH_WHEEL and IPEX_WHEEL arg values Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix and updates for PyTorch WW14-2022 SPR (#543) * Fix and updates for PyTorch WW14-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix and updates for TensorFlow WW18-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix TensorFlow SPR nightly versions Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update pre-trained models download URLs Signed-off-by: Abolfazl Shahbazi <[email protected]> * Intall Python 3.8 development tools Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix OpenMPI install and setup Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Horovod Installaion for SPR and CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Python3.8 version for CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo in TensorFlow 3d-unet partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a broken partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add TCMalloc to TF base container for SPR and remove OpenSSL Signed-off-by: Abolfazl Shahbazi <[email protected]> * Updates required to the base image Signed-off-by: Abolfazl Shahbazi <[email protected]> * Remove some repositories Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'matplotlib' for '3d-unet' Signed-off-by: Abolfazl Shahbazi <[email protected]> * switch to build OpenMPI due to issue in Market Place provided version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYTORCH_WHEEL and IPEX_WHEEL arg values Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYT resnet50 quickstart scripts for both Linux and Windows (#547) * fix quickstart scripts, detect platform type, update to run with pytorch only * Fix SPR PyTorch MaskRCNN inference documentation for CHECKPOINT_DIR (#548) * Enable bert large multi stream inference (#554) * test bert multi stream module * enable input split and output concat for accuracy run * change the default num_streams batchsize cores to 56 * change ssd multi stream throughput to 1 core 1 batch * change the default parameter for rn50 ssd multi stream module * modify enable_ipex_for_squad.diff to align new multistream hint implementation * enable warmup and multi socket support * change default parameter for rn50 ssd multi stream inference * Add train-no-eval for rn50 pytorch (#555) * PyTorch SPR BERT large training updates (h5py and dataset instructions) and update LD_PRELOAD for SPR entrypoints (#550) * Add h5py install to bert training dockerfile * documentation updates * update docs, and add input_preprocessing to the wrapper package * Update LD_PRELOAD trailing : * Fix syntax * removing unnecessary change * Update DLRM entrypoint * Update docs to note that phase2 has bert_config.json in the CHECKPOINT_DIR * Fix syntax * increase shm-size to 10g * [RNN-T training] Update scripts -- run on 1S (#561) * Update maskrcnn training script to run on 1s (#562) * use single node to do ssd-rn34 training (#563) * Update training.sh (#564) * Update training.sh (#565) Use tcmalloc instead of jemalloc * use single node to do resnet50 training (#568) * add numactl -C and remove jit warm in main thread (#569) * Update unit-test.yml (#546) * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Fixed make command, updated pip install. Fixed make command to run from the root directory. Replaced pip install tox with a pip install -r requirements-tests.txt to install all dependencies for the tests. * Add tox to test dependencies. Added tox to the dependencies so that the Workflow and others may install it with pip install -r requirements-test.txt and be covered for running make lint and make unit-test. * Update unit-test.yml Changed 'make unit-test' to 'make unit_test' as that is the actual target defined in the Makefile. * Update unit-test.yml Changed apt-get install command. * re-enable int8 for api change (#579) * saperate fully convergency test from training test (#581) Co-authored-by: jianan-gu <[email protected]> * ssd enable new int8 (#580) * v1 * enable new int8 method * Revert "ssd enable new int8 (#580)" (#584) This reverts commit 9eb3211. * Revert "re-enable int8 for api change (#579)" (#583) This reverts commit 0bded92. * Update training script using 1s (#560) * Enable checkpoint during training for bert-large (#573) * minor fix * Add readme for enabling checkpoint * update phase1 to enable checkpoint by default * Update README.md * Enable ssd bf32 inference training (#589) * enable ssd bf32 inference * enable ssd bf32 train * enable RNN-T bf32 inference (#591) * Enable bf32 for bert and distilbert for inference (#593) * enable bf32 distilbert * enable bert bf32 * Enable RNN-T bf32 training (#594) * enable maskrcnn bf32 inference and training (#595) * enable resnet50 and resnext101 bf16 path (#596) * enable bert bf32 train (#600) * update resnet int8 path using new int8 api (#603) * re-enable int8 for api change (#604) Co-authored-by: jianan-gu <[email protected]> * Leslie/ssd enable new int8 (#605) * v1 * enable new int8 method * update json file * add rn50 int8 weight sharing Co-authored-by: Jiang, Xiaofei <[email protected]> * update ssd training bs to the multily of core numbers (#606) * enable bf32 for dlrm (#607) Co-authored-by: jianan-gu <[email protected]> * Update IPEX new int8 API enabling for distilbert/bert-large (#608) * enable distilbert * enable bert * fix max-ind-range and add memory info (#609) Co-authored-by: jianan-gu <[email protected]> * Remove debug code (#610) * update training steps (#611) * fix bandit scan fails (#612) * PYT Image recognition models support on Windows (#549) * fix all image recognition scripts to run on windows and linux with PYT, and only linux with IPEX * [RNN-T training] fix bandit scan fails (#614) * RNN-T inference: fix IMZ Bandit scan fails (#615) * Update unit-test.yml (#570) Changed the docker user credential to utilize GitHub Secret. * MaskRCNN: fix IMZ Bandit scan fails (#623) * Fix for horovod-related failures in TF nightly runs (#613) * cpp17 horovod failure fix * minor debugging changes * minor fixes - directory name * cleanup * addressing reviewer comments * Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 (#624) * Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Set 'HOROVOD_WITH_MPI=1' explicitly Signed-off-by: Abolfazl Shahbazi <[email protected]> * update GCC version to GCC 9 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'horovodrun --check-build' for sanity check Signed-off-by: Abolfazl Shahbazi <[email protected]> * removo force install inside Docker Signed-off-by: Abolfazl Shahbazi <[email protected]> * [RNN-T training] Fix ddp sample number issue (#625) * update BF32 usage (#627) * resnet50 training: add warm up before collecting time (#628) * image to bf16 (#629) * Update end2end DLSA dockerfile due to SPR wheel path update and removing int8 patch (#631) * Update mlpc path for SPR wheels * remove patch * Update Horovod commit id for BareMetal, Docker will be updated next (#630) Signed-off-by: Abolfazl Shahbazi <[email protected]> * fix dlrm convergence and change training performance BS to 32K (#633) Co-authored-by: jianan-gu <[email protected]> * [RNN-T training] Merge sh files to one (#635) * update torch-ccl into 1.12 (#636) * Liangan1/update torch ccl version (#637) * Update torch_ccl version * resnet50_distributed_training: don't set MASTER_ADDR by user (#638) * Update torch_ccl in script (#639) * Enable offline download distilbert (#632) * enable offline download distilbert * add convert * Update README.md * add accuracy.py * add file * refine download * refine path * refine path * add license * Update dlrm_s_pytorch.py (#643) * Update README.md (#649) * init pytorch T5 language model (#648) * init pytorch T5 language model * update README.md * update doc * update fpn models (#650) * pytorch resnet50: directly call ipex.quantization (#653) * fix int8 accuracy (#655) Co-authored-by: Zhang, Weizhuo <[email protected]> * Made fixes to the broken links (#652) * Made fixes to the broken links * Changed the ResNet50v1_5 version back to v2_7_0 * Modified the setup AI kit instructions Co-authored-by: msalopan <[email protected]> * Update Security Center URL (#657) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Weizhuoz/fix for pt 1.12 (#656) * fix vgg11_bn accuracy syntax error * remove exact_match from roberta-base * modify maskrcnn BS to 2*num_cores * Update dlrm_s_pytorch.py (#660) * Update dlrm_s_pytorch.py Reduce int8 memory usage. * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Add BF32 DDP for bert-large (#663) * Update run_ddp_bert_pretrain_phase1.sh * Update run_ddp_bert_pretrain_phase2.sh * Update README.md * move OMP_NUM_THREADS=1 into dlrm_s_pytorch.py (#664) minor changes * remove rn50 ao (#665) * Re-organize models list to be grouped by framework (#654) * re-organize models list to be grouped by framework * update tensorflow ssd-resnet34 training dataset * add T5 in benchmark/README.md * mannuel set torch num threads only for int8 (#666) * Update inference_performance.sh (#669) * improve ssdrn34 perf. (#671) * improve ssdrn34 perf. * minor update. * Fix linting Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix unit tests too Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * update py version in base spec (#678) * TF addons upgrade to 0.17.1 (#689) * updated tf adons version * remove comment * Sriniva2/ssd rn34 (#682) * improve ssdrn34 perf. * minor update. * enabling synthetic data. * Update base_benchmark_util.py * Fix linting error Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * Update Dockerfiles prior to IMZ 2.8 release (#693) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update Documents prior to IMZ 2.8 release (#694) Signed-off-by: Abolfazl Shahbazi <[email protected]> * add support for open SUSE leap operating system (#708) (#715) * updated tpps (#725) * remove tf bert int8 from main readmes, model is not supported in this release. (#743) * Adding Scipy for TensorFlow serving SSD-MobileNet model (#764) (#766) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * remove .github Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: leslie-fang-intel <[email protected]> Co-authored-by: Dina Suehiro Jones <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: Xiaoming (Jason) Cui <[email protected]> Co-authored-by: jiayisunx <[email protected]> Co-authored-by: Melanie Buehler <[email protected]> Co-authored-by: Srini511 <[email protected]> Co-authored-by: Sean-Michael Riesterer <[email protected]> Co-authored-by: jianan-gu <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: zhuhaozhe <[email protected]> Co-authored-by: Wang, Chuanqi <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: Weizhuo Zhang <[email protected]> Co-authored-by: xiaofeij <[email protected]> Co-authored-by: liangan1 <[email protected]> Co-authored-by: blzheng <[email protected]> Co-authored-by: Om Thakkar <[email protected]> Co-authored-by: mahathis <[email protected]> Co-authored-by: msalopan <[email protected]> Co-authored-by: Jitendra Patil <[email protected]>
* revert bf16 changes (#488) * Add partials and spec yml for the end2end DLSA pipeline (#460) * Add partials and specs for the end2end DLSA pipeline * Add missing end line * Update name to include ipex * update specs to have use the public image as a base on one and SPR for the other * Dockerfile updates for the updated DLSA repo * Update pip install list * Rename to public * Removing partials that aren't used anymore * Fixes for 'kmp-blocktime' env var (#493) * Fixes for 'kmp-blocktime' env var Signed-off-by: Abolfazl Shahbazi <[email protected]> * update per review feedback Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'kmp-blocktime' for mlperf-gnmt (#494) * Add 'kmp-blocktime' for mlperf-gnmt Signed-off-by: Abolfazl Shahbazi <[email protected]> * Remove duplicate parameter definition Signed-off-by: Abolfazl Shahbazi <[email protected]> * add sample_input for resnet50 training (#495) * remove the case when fragment_size not equal args.batch_size (#500) * Changed the transformer_mlperf fp32 model so that we can fuse the ops… (#389) * Changed the transformer_mlperf fp32 model so that we can fuse the ops in the model, and also minor changes for python3 * Changed the transformer_mlperf int8 model so that we can fuse the ops in the model, and also minor changes for python3 * SPR updates for WW12, 2022 (#492) * SPR updates for WW12, 2022 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update for PyTorch SPR WW2022-12 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update pytorch base for SPR too Signed-off-by: Abolfazl Shahbazi <[email protected]> * Stick with specific 'keras-nightly' version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Updates per code review Signed-off-by: Abolfazl Shahbazi <[email protected]> * update maskrcnn training_multinode.sh (#502) * Fixed a bug in the transformer_mlperf model threads setting (#482) * Fixed a bug in the transformer_mlperf model threads setting * Fix failing tests Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * Added the default threads setting for transformer_mlperf inference in… (#504) * Added the default threads setting for transformer_mlperf inference in case there is no command line input * Fix unit tests Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * PyTorch Image Classification TL notebook (#490) * Adds new TL notebook with documentation * Added newline * Added to main TL README * Small fixes * Updated for review feedback * Added more models and a download limit arg * Removed py3.9 requirement and changed default model * Adds Kitti torchvision dataset to TL notebook (#512) * Adds Kitti torchvision dataset to TL notebook * Fixed citations formatting * update maskrcnn model (#515) * minor update. (#465) * Create unit-test github action workflow (#518) * Create unit-test github action workflow * Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519) Signed-off-by: Abolfazl Shahbazi <[email protected]> * update distilbert model to 4.18 transformers and enable int8 path (#521) * rnnt: use launcher to set output file path and name (#524) * Update BareMetalSetup.md (#526) Always use the latest torchvision * Reduce memory usage for dlrm acc test (#527) * updatedistilbert with text_classification (#529) * add patch for distilbert (#530) * Update the model-builder dockerfile to use ubuntu 20.04 (#532) * Add script for coco training dataset processing (#525) * and update tensorflow ssd-resnet34 training dataset instructions * update patch (#533) Co-authored-by: Wang, Chuanqi <[email protected]> * [RNN-T training] Enable FP32 gemm using oneDNN (#531) * Update the Readme guide for distilbert (#534) * Update the Readme guide for distilbert * Fix accuracy grep bug, and grep accuracy for distilbert Co-authored-by: Weizhuo Zhang <[email protected]> * Update end2end public dockerfile to look for IPEX in the conda directory (#535) * Notebook to script conversion example (#516) * Add notebook script conversion example * Fixed doc * Replaces custom preprocessor with built-in one * Changed tag to remove_for_custom_dataset * Add URL check prior to calling urlretrieve (#538) * Add URL check prior to calling urlretrieve Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo Signed-off-by: Abolfazl Shahbazi <[email protected]> * disable for ssd since fused cat cat kernel is slow (#537) * fix bug when adding steps in rnnt inference (#528) * Fix and updates for TensorFlow WW18-2022 SPR (#542) * Fix and updates for TensorFlow WW18-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix TensorFlow SPR nightly versions Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update pre-trained models download URLs Signed-off-by: Abolfazl Shahbazi <[email protected]> * Intall Python 3.8 development tools Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix OpenMPI install and setup Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Horovod Installaion for SPR and CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Python3.8 version for CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo in TensorFlow 3d-unet partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a broken partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add TCMalloc to TF base container for SPR and remove OpenSSL Signed-off-by: Abolfazl Shahbazi <[email protected]> * Remove some repositories Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'matplotlib' for '3d-unet' Signed-off-by: Abolfazl Shahbazi <[email protected]> * switch to build OpenMPI due to issue in Market Place provided version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYTORCH_WHEEL and IPEX_WHEEL arg values Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix and updates for PyTorch WW14-2022 SPR (#543) * Fix and updates for PyTorch WW14-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix and updates for TensorFlow WW18-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix TensorFlow SPR nightly versions Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update pre-trained models download URLs Signed-off-by: Abolfazl Shahbazi <[email protected]> * Intall Python 3.8 development tools Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix OpenMPI install and setup Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Horovod Installaion for SPR and CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Python3.8 version for CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo in TensorFlow 3d-unet partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a broken partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add TCMalloc to TF base container for SPR and remove OpenSSL Signed-off-by: Abolfazl Shahbazi <[email protected]> * Updates required to the base image Signed-off-by: Abolfazl Shahbazi <[email protected]> * Remove some repositories Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'matplotlib' for '3d-unet' Signed-off-by: Abolfazl Shahbazi <[email protected]> * switch to build OpenMPI due to issue in Market Place provided version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYTORCH_WHEEL and IPEX_WHEEL arg values Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYT resnet50 quickstart scripts for both Linux and Windows (#547) * fix quickstart scripts, detect platform type, update to run with pytorch only * Fix SPR PyTorch MaskRCNN inference documentation for CHECKPOINT_DIR (#548) * Enable bert large multi stream inference (#554) * test bert multi stream module * enable input split and output concat for accuracy run * change the default num_streams batchsize cores to 56 * change ssd multi stream throughput to 1 core 1 batch * change the default parameter for rn50 ssd multi stream module * modify enable_ipex_for_squad.diff to align new multistream hint implementation * enable warmup and multi socket support * change default parameter for rn50 ssd multi stream inference * Add train-no-eval for rn50 pytorch (#555) * PyTorch SPR BERT large training updates (h5py and dataset instructions) and update LD_PRELOAD for SPR entrypoints (#550) * Add h5py install to bert training dockerfile * documentation updates * update docs, and add input_preprocessing to the wrapper package * Update LD_PRELOAD trailing : * Fix syntax * removing unnecessary change * Update DLRM entrypoint * Update docs to note that phase2 has bert_config.json in the CHECKPOINT_DIR * Fix syntax * increase shm-size to 10g * [RNN-T training] Update scripts -- run on 1S (#561) * Update maskrcnn training script to run on 1s (#562) * use single node to do ssd-rn34 training (#563) * Update training.sh (#564) * Update training.sh (#565) Use tcmalloc instead of jemalloc * use single node to do resnet50 training (#568) * add numactl -C and remove jit warm in main thread (#569) * Update unit-test.yml (#546) * re-enable int8 for api change (#579) * saperate fully convergency test from training test (#581) Co-authored-by: jianan-gu <[email protected]> * ssd enable new int8 (#580) * v1 * enable new int8 method * Revert "ssd enable new int8 (#580)" (#584) This reverts commit 9eb3211. * Revert "re-enable int8 for api change (#579)" (#583) This reverts commit 0bded92. * Update training script using 1s (#560) * Enable checkpoint during training for bert-large (#573) * minor fix * Add readme for enabling checkpoint * update phase1 to enable checkpoint by default * Update README.md * Enable ssd bf32 inference training (#589) * enable ssd bf32 inference * enable ssd bf32 train * enable RNN-T bf32 inference (#591) * Enable bf32 for bert and distilbert for inference (#593) * enable bf32 distilbert * enable bert bf32 * Enable RNN-T bf32 training (#594) * enable maskrcnn bf32 inference and training (#595) * enable resnet50 and resnext101 bf16 path (#596) * enable bert bf32 train (#600) * update resnet int8 path using new int8 api (#603) * re-enable int8 for api change (#604) Co-authored-by: jianan-gu <[email protected]> * Leslie/ssd enable new int8 (#605) * v1 * enable new int8 method * update json file * add rn50 int8 weight sharing Co-authored-by: Jiang, Xiaofei <[email protected]> * update ssd training bs to the multily of core numbers (#606) * enable bf32 for dlrm (#607) Co-authored-by: jianan-gu <[email protected]> * Update IPEX new int8 API enabling for distilbert/bert-large (#608) * enable distilbert * enable bert * fix max-ind-range and add memory info (#609) Co-authored-by: jianan-gu <[email protected]> * Remove debug code (#610) * update training steps (#611) * fix bandit scan fails (#612) * PYT Image recognition models support on Windows (#549) * fix all image recognition scripts to run on windows and linux with PYT, and only linux with IPEX * [RNN-T training] fix bandit scan fails (#614) * RNN-T inference: fix IMZ Bandit scan fails (#615) * Update unit-test.yml (#570) * MaskRCNN: fix IMZ Bandit scan fails (#623) * Fix for horovod-related failures in TF nightly runs (#613) * cpp17 horovod failure fix * minor debugging changes * minor fixes - directory name * cleanup * addressing reviewer comments * Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 (#624) * Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Set 'HOROVOD_WITH_MPI=1' explicitly Signed-off-by: Abolfazl Shahbazi <[email protected]> * update GCC version to GCC 9 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'horovodrun --check-build' for sanity check Signed-off-by: Abolfazl Shahbazi <[email protected]> * removo force install inside Docker Signed-off-by: Abolfazl Shahbazi <[email protected]> * [RNN-T training] Fix ddp sample number issue (#625) * update BF32 usage (#627) * resnet50 training: add warm up before collecting time (#628) * image to bf16 (#629) * Update end2end DLSA dockerfile due to SPR wheel path update and removing int8 patch (#631) * Update mlpc path for SPR wheels * remove patch * Update Horovod commit id for BareMetal, Docker will be updated next (#630) Signed-off-by: Abolfazl Shahbazi <[email protected]> * fix dlrm convergence and change training performance BS to 32K (#633) Co-authored-by: jianan-gu <[email protected]> * [RNN-T training] Merge sh files to one (#635) * update torch-ccl into 1.12 (#636) * Liangan1/update torch ccl version (#637) * Update torch_ccl version * resnet50_distributed_training: don't set MASTER_ADDR by user (#638) * Update torch_ccl in script (#639) * Enable offline download distilbert (#632) * enable offline download distilbert * add convert * Update README.md * add accuracy.py * add file * refine download * refine path * refine path * add license * Update dlrm_s_pytorch.py (#643) * Update README.md (#649) * init pytorch T5 language model (#648) * init pytorch T5 language model * update README.md * update doc * update fpn models (#650) * pytorch resnet50: directly call ipex.quantization (#653) * fix int8 accuracy (#655) Co-authored-by: Zhang, Weizhuo <[email protected]> * Made fixes to the broken links (#652) * Made fixes to the broken links * Changed the ResNet50v1_5 version back to v2_7_0 * Modified the setup AI kit instructions Co-authored-by: msalopan <[email protected]> * Update Security Center URL (#657) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Weizhuoz/fix for pt 1.12 (#656) * fix vgg11_bn accuracy syntax error * remove exact_match from roberta-base * modify maskrcnn BS to 2*num_cores * Update dlrm_s_pytorch.py (#660) * Update dlrm_s_pytorch.py Reduce int8 memory usage. * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Add BF32 DDP for bert-large (#663) * Update run_ddp_bert_pretrain_phase1.sh * Update run_ddp_bert_pretrain_phase2.sh * Update README.md * move OMP_NUM_THREADS=1 into dlrm_s_pytorch.py (#664) minor changes * remove rn50 ao (#665) * Re-organize models list to be grouped by framework (#654) * re-organize models list to be grouped by framework * update tensorflow ssd-resnet34 training dataset * add T5 in benchmark/README.md * mannuel set torch num threads only for int8 (#666) * Update inference_performance.sh (#669) * improve ssdrn34 perf. (#671) * improve ssdrn34 perf. * minor update. * Fix linting Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix unit tests too Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * Use IPEX Pytorch whls instead of building IPEX from source (#674) Co-authored-by: Clayne Robison <[email protected]> * Lpot2inc (#446) Co-authored-by: ltsai1 <[email protected]> * Sriniva2/ssd rn34 (#682) * improve ssdrn34 perf. * minor update. * enabling synthetic data. * Update base_benchmark_util.py * Fix linting error Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * Add doc updates for '--synthetic-data' option (#683) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Change checkpoint setting for Bert train phase 1 (#602) * Change checkpoint setting for Bert train phase 1 * fix model and config saving * fix error when runing gpu path (#686) * fix load pretrained model error when using torch_ccl (#688) * update py version in base spec (#678) (#690) * TF addons upgrade to 0.17.1 (#689) (#691) * updated tf adons version * remove comment * Update Dockerfiles prior to IMZ 2.8 release (#693) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update Documents prior to IMZ 2.8 release (#694) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update README.md (#697) * change numpy version requirement (#703) * Remove MiniGo training from IMZ (#644) * remove MiniGo training scripts and unit test * [RNN-T] [Inference] optimize the batch decoder (#711) * reduce fill_ OP in rnnt embedding kernel * optimize add between int and log to reduce dtype conversion * rnnt: support dump tracing file and print profile table (#712) * add support for open SUSE leap operating system (#708) * rnnt inference: pre convert data to bf16 (#713) * remove squeeze/slice/transpose (#714) * update resnet50 training code (#710) * update resnet50 training code * not using ipex optimize for resnet50 training * use ipex.optimize() on the whole model (#718) * resnet50 bf32: calling ipex.optimize to enable bf32 path (#719) * Added batch size as an env variable to the quickstart scripts (#676) Co-authored-by: Clayne Robison <[email protected]> * Added batchsize as an env variable to quickstart scripts (#680) * updated readme: nit fix (#723) Co-authored-by: Rahul Nair <[email protected]> * compute throughput by test_mini_batch_size (#740) * pytorch resnet50: fix bf32 training path error (#739) * Fix a subtle 'E275' style issue that causes unknown behavior (#742) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * rearrange the paragraphs and fix Markdown headers (#744) * Align Transformers version for BERT models (#738) * align transformer version(4.18) for bert models * change scripts to legacy * redo calibration * patch fix * Update README.md (#746) * Add support for stock PYT- object detection models (#732) * stock PYT and windows support for object detection models * Weizhuoz/reduce model zoo steps (#762) * reduce steps for bert-base, roberta, fpn models * modify max_iter for fpn models * reduce all img classification models steps * update new config for bert models (#763) * Addin Scipy for TensorFlow serving SSD-MobileNet model (#764) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update TF ResNet50v1.5 inference for SPR (baremetal) (#749) * Added matplotlib dependency to image_segmentation requirements (#768) * Update readmes for the path to output directory (#769) * update wide & deep readme for the path to pretrained model directory (#771) * add a check for ubuntu 22.04 support (#721) * Changes to add bfloat16 support for DIEN training (#679) * Changes to add bfloat16 support for DIEN training * Some for for reporting performance * Fixes for dien training and unit tests * updated tpp file withr2.8 approvals (#773) * Add Windows stock PyTorch support for TransNet v2 (#779) * update TransNet v2 to work with stock pytorch * update Windows.md path in all relevant docs * add P99 metric for LZ models (#780) Co-authored-by: Weizhuo Zhang <[email protected]> * Rn50 training multiple epoches output 1 KPI and add training_steps argument. (#775) * enable --training_steps and 1 training KPI output with multiple epoches * add prefix * update print freq * fix display bug * enable PyTorch resnet50 fp16 path (#783) * enable PyTorch resnet50 fp16 path * fix conflict * Extract p99 metric from log to summary (#784) * enable fp16 bert train and inference (#782) * Vruddarr/pt update windows readmes (#778) * remove bfloat16 experimental support note (#786) * Update IPEX installation path (#788) * Clean up _pycache_ files, remove symlinks, and add license headers for dien training bf16 (#787) * update readme for jemalloc and iomp path (#789) * update readme for jemalloc and iomp path * Updated IOMP path as path to the intel-openmp directory * PyTorch: fix resnext101 running script (#795) * Update 3dunet mlperf bash scripts and README (#797) * update 3dunet mlperf doc to use quickstart scripts, rename quickstart scripts for multi-instance * fix tests job (#803) * rnnt inference: align replace lstm API due to IPEX change (#802) * Adding quick start scripts to MobileNetV1 bfloat16 precision (#793) * Adding quick start scripts to ssd-mobilenet bfloat16 precision (#798) * Update T5 model with windows quick start scripts (#790) * Update T5 model with windows quick start scripts * Updated Readme by specifying values to environment variables * Update inference int8 readme and script of 4 CV models using INC (#698) * update docs to add INC int8 models as an option * add instructions for how to quantize a fp32 model using INC * rnnt: fix stft due to PyTorch API change (#811) * rnnt training: fix stft due to PyTorch API change (#813) * Update BareMetalSetup.md (#817) * Gerardod/build container (#807) First phase of GHA WF to build the image of a Model Zoo workload container and push it to CAAS. * Sharvils/tf workload (#808) * TFv2.10 support added. Horovod version updated. * Vruddarr/tf add language translation bert fp32 quick start scripts (#804) * Adding quick start scripts to language translation BERT FP32 model * Updated TL notebooks for SPR Launch (#810) * Updates for TL PyTorch notebook * Edits for two more TL notebooks * Reverting previous change for virtualenv * Removed --no-deps and some nonexistent links * Added TFHub cache dir * Updated TL notebook README for legal/branding * Update typo in Readme (#821) Co-authored-by: veena.mounika.ruddarraju <[email protected]> * PyTorch: using ipex.optimize for bf16 training (#824) * Fix CVEs for Pillow and notebook packages (#831) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * add intel-alphafold2 optimized w/ IPEX from realm of AIDD (#737) * add alphafold2 from AIDD realm * Remove unused variable in mlperf 3DUnet performance run (#832) * Update Model Zoo name, Python version and message for IPEX (#833) * Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updt… (#830) * Update models main tables (#836) *update main readmes * Adding jemalloc instructions and environment variables (#838) * Add support for dGPU models (#840) * add support for dGPU support * remove spr dockerfiles and spec files (#842) * delete links to 3dunet mlperf and bert large int8 (#841) * update tbb files (#843) * fix vulnerability issues reported by snyk scans (#848) * update for new precision (#849) * upgrade for ipex 1.13 * delete workflows Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: leslie-fang-intel <[email protected]> Co-authored-by: Dina Suehiro Jones <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: Xiaoming (Jason) Cui <[email protected]> Co-authored-by: jiayisunx <[email protected]> Co-authored-by: Melanie Buehler <[email protected]> Co-authored-by: Srini511 <[email protected]> Co-authored-by: Sean-Michael Riesterer <[email protected]> Co-authored-by: jianan-gu <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: zhuhaozhe <[email protected]> Co-authored-by: Wang, Chuanqi <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: Weizhuo Zhang <[email protected]> Co-authored-by: xiaofeij <[email protected]> Co-authored-by: liangan1 <[email protected]> Co-authored-by: blzheng <[email protected]> Co-authored-by: Om Thakkar <[email protected]> Co-authored-by: mahathis <[email protected]> Co-authored-by: Clayne Robison <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Neo Zhang Jianyu <[email protected]> Co-authored-by: ltsai1 <[email protected]> Co-authored-by: Jitendra Patil <[email protected]> Co-authored-by: Kanvi Khanna <[email protected]> Co-authored-by: Rahul Nair <[email protected]> Co-authored-by: Veena2207 <[email protected]> Co-authored-by: jojivk-intel-nervana <[email protected]> Co-authored-by: xiangdong <[email protected]> Co-authored-by: Huang, Zhiwei <[email protected]> Co-authored-by: gera-aldama <[email protected]> Co-authored-by: Sharvil Shah <[email protected]> Co-authored-by: wyang2 <[email protected]> Co-authored-by: Yimei Sun <[email protected]>
Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]>
* Update Pillow to '>=9.3.0' (#884) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * remove supported OS checks (#926) * Remove Linux/windows OS platform support checks (#927) * upgrade Pillow version for Yolov4 Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]>
…odels.intel-models
* rnnt: use launcher to set output file path and name (#524) * Update BareMetalSetup.md (#526) Always use the latest torchvision * Reduce memory usage for dlrm acc test (#527) * updatedistilbert with text_classification (#529) * add patch for distilbert (#530) * Update the model-builder dockerfile to use ubuntu 20.04 (#532) * Add script for coco training dataset processing (#525) * and update tensorflow ssd-resnet34 training dataset instructions * update patch (#533) Co-authored-by: Wang, Chuanqi <[email protected]> * [RNN-T training] Enable FP32 gemm using oneDNN (#531) * Update the Readme guide for distilbert (#534) * Update the Readme guide for distilbert * Fix accuracy grep bug, and grep accuracy for distilbert Co-authored-by: Weizhuo Zhang <[email protected]> * Update end2end public dockerfile to look for IPEX in the conda directory (#535) * Notebook to script conversion example (#516) * Add notebook script conversion example * Fixed doc * Replaces custom preprocessor with built-in one * Changed tag to remove_for_custom_dataset * Add URL check prior to calling urlretrieve (#538) * Add URL check prior to calling urlretrieve Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo Signed-off-by: Abolfazl Shahbazi <[email protected]> * disable for ssd since fused cat cat kernel is slow (#537) * fix bug when adding steps in rnnt inference (#528) * Fix and updates for TensorFlow WW18-2022 SPR (#542) * Fix and updates for TensorFlow WW18-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix TensorFlow SPR nightly versions Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update pre-trained models download URLs Signed-off-by: Abolfazl Shahbazi <[email protected]> * Intall Python 3.8 development tools Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix OpenMPI install and setup Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Horovod Installaion for SPR and CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Python3.8 version for CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo in TensorFlow 3d-unet partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a broken partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add TCMalloc to TF base container for SPR and remove OpenSSL Signed-off-by: Abolfazl Shahbazi <[email protected]> * Remove some repositories Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'matplotlib' for '3d-unet' Signed-off-by: Abolfazl Shahbazi <[email protected]> * switch to build OpenMPI due to issue in Market Place provided version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYTORCH_WHEEL and IPEX_WHEEL arg values Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix and updates for PyTorch WW14-2022 SPR (#543) * Fix and updates for PyTorch WW14-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix and updates for TensorFlow WW18-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix TensorFlow SPR nightly versions Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update pre-trained models download URLs Signed-off-by: Abolfazl Shahbazi <[email protected]> * Intall Python 3.8 development tools Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix OpenMPI install and setup Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Horovod Installaion for SPR and CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Python3.8 version for CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo in TensorFlow 3d-unet partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a broken partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add TCMalloc to TF base container for SPR and remove OpenSSL Signed-off-by: Abolfazl Shahbazi <[email protected]> * Updates required to the base image Signed-off-by: Abolfazl Shahbazi <[email protected]> * Remove some repositories Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'matplotlib' for '3d-unet' Signed-off-by: Abolfazl Shahbazi <[email protected]> * switch to build OpenMPI due to issue in Market Place provided version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYTORCH_WHEEL and IPEX_WHEEL arg values Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYT resnet50 quickstart scripts for both Linux and Windows (#547) * fix quickstart scripts, detect platform type, update to run with pytorch only * Fix SPR PyTorch MaskRCNN inference documentation for CHECKPOINT_DIR (#548) * Enable bert large multi stream inference (#554) * test bert multi stream module * enable input split and output concat for accuracy run * change the default num_streams batchsize cores to 56 * change ssd multi stream throughput to 1 core 1 batch * change the default parameter for rn50 ssd multi stream module * modify enable_ipex_for_squad.diff to align new multistream hint implementation * enable warmup and multi socket support * change default parameter for rn50 ssd multi stream inference * Add train-no-eval for rn50 pytorch (#555) * PyTorch SPR BERT large training updates (h5py and dataset instructions) and update LD_PRELOAD for SPR entrypoints (#550) * Add h5py install to bert training dockerfile * documentation updates * update docs, and add input_preprocessing to the wrapper package * Update LD_PRELOAD trailing : * Fix syntax * removing unnecessary change * Update DLRM entrypoint * Update docs to note that phase2 has bert_config.json in the CHECKPOINT_DIR * Fix syntax * increase shm-size to 10g * [RNN-T training] Update scripts -- run on 1S (#561) * Update maskrcnn training script to run on 1s (#562) * use single node to do ssd-rn34 training (#563) * Update training.sh (#564) * Update training.sh (#565) Use tcmalloc instead of jemalloc * use single node to do resnet50 training (#568) * add numactl -C and remove jit warm in main thread (#569) * Update unit-test.yml (#546) * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Fixed make command, updated pip install. Fixed make command to run from the root directory. Replaced pip install tox with a pip install -r requirements-tests.txt to install all dependencies for the tests. * Add tox to test dependencies. Added tox to the dependencies so that the Workflow and others may install it with pip install -r requirements-test.txt and be covered for running make lint and make unit-test. * Update unit-test.yml Changed 'make unit-test' to 'make unit_test' as that is the actual target defined in the Makefile. * Update unit-test.yml Changed apt-get install command. * re-enable int8 for api change (#579) * saperate fully convergency test from training test (#581) Co-authored-by: jianan-gu <[email protected]> * ssd enable new int8 (#580) * v1 * enable new int8 method * Revert "ssd enable new int8 (#580)" (#584) This reverts commit 9eb3211. * Revert "re-enable int8 for api change (#579)" (#583) This reverts commit 0bded92. * Update training script using 1s (#560) * Enable checkpoint during training for bert-large (#573) * minor fix * Add readme for enabling checkpoint * update phase1 to enable checkpoint by default * Update README.md * Enable ssd bf32 inference training (#589) * enable ssd bf32 inference * enable ssd bf32 train * enable RNN-T bf32 inference (#591) * Enable bf32 for bert and distilbert for inference (#593) * enable bf32 distilbert * enable bert bf32 * Enable RNN-T bf32 training (#594) * enable maskrcnn bf32 inference and training (#595) * enable resnet50 and resnext101 bf16 path (#596) * enable bert bf32 train (#600) * update resnet int8 path using new int8 api (#603) * re-enable int8 for api change (#604) Co-authored-by: jianan-gu <[email protected]> * Leslie/ssd enable new int8 (#605) * v1 * enable new int8 method * update json file * add rn50 int8 weight sharing Co-authored-by: Jiang, Xiaofei <[email protected]> * update ssd training bs to the multily of core numbers (#606) * enable bf32 for dlrm (#607) Co-authored-by: jianan-gu <[email protected]> * Update IPEX new int8 API enabling for distilbert/bert-large (#608) * enable distilbert * enable bert * fix max-ind-range and add memory info (#609) Co-authored-by: jianan-gu <[email protected]> * Remove debug code (#610) * update training steps (#611) * fix bandit scan fails (#612) * PYT Image recognition models support on Windows (#549) * fix all image recognition scripts to run on windows and linux with PYT, and only linux with IPEX * [RNN-T training] fix bandit scan fails (#614) * RNN-T inference: fix IMZ Bandit scan fails (#615) * Update unit-test.yml (#570) Changed the docker user credential to utilize GitHub Secret. * MaskRCNN: fix IMZ Bandit scan fails (#623) * Fix for horovod-related failures in TF nightly runs (#613) * cpp17 horovod failure fix * minor debugging changes * minor fixes - directory name * cleanup * addressing reviewer comments * Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 (#624) * Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Set 'HOROVOD_WITH_MPI=1' explicitly Signed-off-by: Abolfazl Shahbazi <[email protected]> * update GCC version to GCC 9 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'horovodrun --check-build' for sanity check Signed-off-by: Abolfazl Shahbazi <[email protected]> * removo force install inside Docker Signed-off-by: Abolfazl Shahbazi <[email protected]> * [RNN-T training] Fix ddp sample number issue (#625) * update BF32 usage (#627) * resnet50 training: add warm up before collecting time (#628) * image to bf16 (#629) * Update end2end DLSA dockerfile due to SPR wheel path update and removing int8 patch (#631) * Update mlpc path for SPR wheels * remove patch * Update Horovod commit id for BareMetal, Docker will be updated next (#630) Signed-off-by: Abolfazl Shahbazi <[email protected]> * fix dlrm convergence and change training performance BS to 32K (#633) Co-authored-by: jianan-gu <[email protected]> * [RNN-T training] Merge sh files to one (#635) * update torch-ccl into 1.12 (#636) * Liangan1/update torch ccl version (#637) * Update torch_ccl version * resnet50_distributed_training: don't set MASTER_ADDR by user (#638) * Update torch_ccl in script (#639) * Enable offline download distilbert (#632) * enable offline download distilbert * add convert * Update README.md * add accuracy.py * add file * refine download * refine path * refine path * add license * Update dlrm_s_pytorch.py (#643) * Update README.md (#649) * init pytorch T5 language model (#648) * init pytorch T5 language model * update README.md * update doc * update fpn models (#650) * pytorch resnet50: directly call ipex.quantization (#653) * fix int8 accuracy (#655) Co-authored-by: Zhang, Weizhuo <[email protected]> * Made fixes to the broken links (#652) * Changed the ResNet50v1_5 version back to v2_7_0 * Update Security Center URL (#657) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Weizhuoz/fix for pt 1.12 (#656) * fix vgg11_bn accuracy syntax error * remove exact_match from roberta-base * modify maskrcnn BS to 2*num_cores * Update dlrm_s_pytorch.py (#660) * Update dlrm_s_pytorch.py Reduce int8 memory usage. * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Add BF32 DDP for bert-large (#663) * Update run_ddp_bert_pretrain_phase1.sh * Update run_ddp_bert_pretrain_phase2.sh * Update README.md * move OMP_NUM_THREADS=1 into dlrm_s_pytorch.py (#664) minor changes * remove rn50 ao (#665) * Re-organize models list to be grouped by framework (#654) * re-organize models list to be grouped by framework * update tensorflow ssd-resnet34 training dataset * add T5 in benchmark/README.md * mannuel set torch num threads only for int8 (#666) * Update inference_performance.sh (#669) * improve ssdrn34 perf. (#671) * improve ssdrn34 perf. * minor update. * Fix linting Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix unit tests too Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * Use IPEX Pytorch whls instead of building IPEX from source (#674) * Use IPEX Pytorch whls instead of building IPEX from source * Corrected the link to install pytorch/IPEX * Corrected the link to install pytorch/IPEX * Updated the link with latest tutorial to install pytorch/IPEX * Update docs/general/pytorch/BareMetalSetup.md Co-authored-by: Clayne Robison <[email protected]> * Update docs/general/pytorch/BareMetalSetup.md Co-authored-by: Clayne Robison <[email protected]> * Made the suggested tweaks in the names * Adding condition to install jemalloc and tcmalloc Co-authored-by: Clayne Robison <[email protected]> * Added condition to install jemalloc, tcmalloc, vision and torch-ccl * Added some tweaks Co-authored-by: Clayne Robison <[email protected]> Co-authored-by: root <[email protected]> * Lpot2inc (#446) * draft for lpot quantization and perf analysis jupyter notebook * update with formal name of model zoo, correct wrong words, add license in python file * rm empty line * renmae LPOT to INC in text and code, and use new api * Update README.md * Update set_env.sh * Update README.md * Update ut.sh * Update local_banchmark.sh * Create local_benchmark.sh * Update README.md * Update inc_for_tensorflow.ipynb * Update ut.sh * Update README.md * rename to local_benchmark.sh * Update ut.sh * Update ut.sh * Update run_jupyter.sh * Delete lpot_for_tensorflow.ipynb * Delete lpot_quantize_model.py * Update README.md * Update README.md * Update README.md * Update inc_for_tensorflow.ipynb * Update README.md * Update README.md * Update inc_for_tensorflow.ipynb * Update requirements.txt Co-authored-by: ltsai1 <[email protected]> * Sriniva2/ssd rn34 (#682) * improve ssdrn34 perf. * minor update. * enabling synthetic data. * Update base_benchmark_util.py * Fix linting error Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * Add doc updates for '--synthetic-data' option (#683) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Change checkpoint setting for Bert train phase 1 (#602) * Change checkpoint setting for Bert train phase 1 * fix model and config saving * fix error when runing gpu path (#686) * fix load pretrained model error when using torch_ccl (#688) * update py version in base spec (#678) (#690) * TF addons upgrade to 0.17.1 (#689) (#691) * updated tf adons version * remove comment * Update Dockerfiles prior to IMZ 2.8 release (#693) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update Documents prior to IMZ 2.8 release (#694) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update README.md (#697) * change numpy version requirement (#703) * Remove MiniGo training from IMZ (#644) * remove MiniGo training scripts and unit test * [RNN-T] [Inference] optimize the batch decoder (#711) * reduce fill_ OP in rnnt embedding kernel * optimize add between int and log to reduce dtype conversion * rnnt: support dump tracing file and print profile table (#712) * add support for open SUSE leap operating system (#708) * rnnt inference: pre convert data to bf16 (#713) * remove squeeze/slice/transpose (#714) * update resnet50 training code (#710) * update resnet50 training code * not using ipex optimize for resnet50 training * use ipex.optimize() on the whole model (#718) * resnet50 bf32: calling ipex.optimize to enable bf32 path (#719) * Added batch size as an env variable to the quickstart scripts (#676) * WIP: Adding batch size as an environment variable to the quickstart scripts * Added instructions in README.md for all workloads * Update README.md * Corrected typo in launch_benchmark * Made corrections to .docs and ran model-builder * Delete .README.md.swp * Delete .fp32_accuracy.sh.swp * Update quickstart/image_segmentation/tensorflow/3d_unet_mlperf/inference/cpu/inference_throughput.sh Co-authored-by: Clayne Robison <[email protected]> * Update quickstart/language_translation/tensorflow/transformer_mlperf/inference/cpu/inference_realtime.sh Co-authored-by: Clayne Robison <[email protected]> * Update benchmarks/launch_benchmark.py Co-authored-by: Clayne Robison <[email protected]> * Made corrections to batch-size parameter * Made changes in launch_benchmark for batch-size arg * Made modifications to the README's * Resolved merge conflict by keeping README.md file. * Modified readme for windows * Resolved merge conflict by keeping README.md file. * Corrected SPR run.sh scripts * Removed echo from run.sh Co-authored-by: Clayne Robison <[email protected]> * Added batchsize as an env variable to quickstart scripts (#680) * Added batchsize as an env variable to quickstart scripts * Made modifications to .docs and scripts * Made modifications to README * Resolved merge conflict by incorporating both suggestions. * Made corrections in README.md * Made corrections in README.md * Undo changes in training.sh file * updated readme: nit fix (#723) Co-authored-by: Rahul Nair <[email protected]> * compute throughput by test_mini_batch_size (#740) * pytorch resnet50: fix bf32 training path error (#739) * Fix a subtle 'E275' style issue that causes unknown behavior (#742) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * rearrange the paragraphs and fix Markdown headers (#744) * Align Transformers version for BERT models (#738) * align transformer version(4.18) for bert models * change scripts to legacy * redo calibration * patch fix * Update README.md (#746) * Add support for stock PYT- object detection models (#732) * stock PYT and windows support for object detection models * Weizhuoz/reduce model zoo steps (#762) * reduce steps for bert-base, roberta, fpn models * modify max_iter for fpn models * reduce all img classification models steps * update new config for bert models (#763) * Addin Scipy for TensorFlow serving SSD-MobileNet model (#764) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update TF ResNet50v1.5 inference for SPR (baremetal) (#749) * Added matplotlib dependency to image_segmentation requirements (#768) * Update readmes for the path to output directory (#769) * update wide & deep readme for the path to pretrained model directory (#771) * add a check for ubuntu 22.04 support (#721) * Changes to add bfloat16 support for DIEN training (#679) * Changes to add bfloat16 support for DIEN training * Some for for reporting performance * Fixes for dien training and unit tests * updated tpp file withr2.8 approvals (#773) * Add Windows stock PyTorch support for TransNet v2 (#779) * update TransNet v2 to work with stock pytorch * update Windows.md path in all relevant docs * add P99 metric for LZ models (#780) Co-authored-by: Weizhuo Zhang <[email protected]> * Rn50 training multiple epoches output 1 KPI and add training_steps argument. (#775) * enable --training_steps and 1 training KPI output with multiple epoches * add prefix * update print freq * fix display bug * enable PyTorch resnet50 fp16 path (#783) * enable PyTorch resnet50 fp16 path * fix conflict * Extract p99 metric from log to summary (#784) * enable fp16 bert train and inference (#782) * Vruddarr/pt update windows readmes (#778) * remove bfloat16 experimental support note (#786) * Update IPEX installation path (#788) * Clean up _pycache_ files, remove symlinks, and add license headers for dien training bf16 (#787) * update readme for jemalloc and iomp path (#789) * update readme for jemalloc and iomp path * Updated IOMP path as path to the intel-openmp directory * PyTorch: fix resnext101 running script (#795) * Update 3dunet mlperf bash scripts and README (#797) * update 3dunet mlperf doc to use quickstart scripts, rename quickstart scripts for multi-instance * fix tests job (#803) * rnnt inference: align replace lstm API due to IPEX change (#802) * Adding quick start scripts to MobileNetV1 bfloat16 precision (#793) * Adding quick start scripts to ssd-mobilenet bfloat16 precision (#798) * Update T5 model with windows quick start scripts (#790) * Update T5 model with windows quick start scripts * Updated Readme by specifying values to environment variables * Update inference int8 readme and script of 4 CV models using INC (#698) * update docs to add INC int8 models as an option * add instructions for how to quantize a fp32 model using INC * rnnt: fix stft due to PyTorch API change (#811) * rnnt training: fix stft due to PyTorch API change (#813) * Update BareMetalSetup.md (#817) * Gerardod/build container (#807) First phase of GHA WF to build the image of a Model Zoo workload container and push it to CAAS. * Sharvils/tf workload (#808) * TFv2.10 support added. Horovod version updated. * Vruddarr/tf add language translation bert fp32 quick start scripts (#804) * Adding quick start scripts to language translation BERT FP32 model * Changed path to the Readme * Adding spec file <bert-fp32-inference_spec.yml> * Update spec file and model link in Readme tables * Update Readme path in windows.md * Updated TL notebooks for SPR Launch (#810) * Updates for TL PyTorch notebook * Edits for two more TL notebooks * Reverting previous change for virtualenv * Removed --no-deps and some nonexistent links * Added TFHub cache dir * Updated TL notebook README for legal/branding * Update typo in Readme (#821) * PyTorch: using ipex.optimize for bf16 training (#824) * Fix CVEs for Pillow and notebook packages (#831) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * add intel-alphafold2 optimized w/ IPEX from realm of AIDD (#737) * add alphafold2 from AIDD realm * Remove unused variable in mlperf 3DUnet performance run (#832) * Update Model Zoo name, Python version and message for IPEX (#833) * Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updt… (#830) * Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updting the readme by replacing conda with Miniconda. * Adding comment to install torch in BareMetalSetup.md * Adding IPEX version and removing *s * Update models main tables (#836) *update main readmes * Adding jemalloc instructions and environment variables (#838) * DLRM hybrid gradient product (#814) * enable hybrid mergedembedding * Hybrid Merge embedding * refine code * Update model file * Fix data loader issue for distributed trianing * Update the print info * Fix lr issue for sparse table both 2/8 ranks get convergenced with 0.75 epochs Co-authored-by: root <[email protected]> * update the TTT evaluation method by excluding dataloader & metric evaluation (#844) Co-authored-by: Zhang, Liangang <[email protected]> * PyTorch: resnet50 distributed training using lars optimizer (#826) * modify dlrm's sklearn metric eval func to ipex's multi-thread version (#850) * modify recall/precision/f1/ap 's eval as optional (#856) * Port dataloader optimization for distributed training of dlrm (#847) * update the TTT evaluation method by excluding dataloader & metric evaluation * port dataloader optimization for distributed training of dlrm * modify dlrm's sklearn metric eval func to ipex's multi-thread version (#850) * modify recall/precision/f1/ap 's eval as optional (#856) * port dataloader optimization for distributed training of dlrm * delete local bs computation in evaluation stage * modify the TTT output name Co-authored-by: Zhang, Liangang <[email protected]> * Update horovod version to fix run time failure due to Status call (#859) * fix regression for dlrm single node training (#864) Co-authored-by: Weizhuo Zhang <[email protected]> * Update pytorch model zoo table of BF32 with landing zoo models (#865) * Added SNYK scan (#855) * Update SSD-ResNet34 code in start.sh(#862) * Add Distilbert base model for inference (Tensorflow) to model zoo (#815) * Add fp32 inference for distilbert base model * Fix Bert spec file (#873) * 1) Add torch.profiler (#871) 2) change the distributed_training.sh for dlrm to diamond cluster * Update Wide & Deep docs (#875) * The copy of #867(Porting evaluation iteration overlapping) (#876) * port evaluation overlapping * remove debug code * remove debug code * remove unused code * remove unused code * add resnet50 distributed training script (#879) * add resnet50 distributed training script * collect TTT Co-authored-by: XiaobingSuper <[email protected]> * reduce redundant bus traffic (#880) * Port all_to_all index overlapping with interaction and top mlp. (#878) * port all_to_all index overlapping with interaction and top mlp * fix seg fault * Add int8 support for distilbert (#823) * Add fp32 inference for distilbert base model Co-authored-by: syedshahbaaz <[email protected]> * Update DIEN inference docs & quickstart scripts (#869) * Update DIEN docs * update for spr ww42 Co-authored-by: WafaaT <[email protected]> * Update ResNet50v1.5 docs (#820) * Update and Validate ResNet50v1.5 Inference and training model for TF SPR * Update and validate docs for TF SPR Co-authored-by: WafaaT <[email protected]> * Update Wide & Deep using Large Dataset docs (#877) * Vruddarr/tf bfloat32 precision check (#893) * Update Wide and Deep Large Dataset Training Model docs (#881) * Vruddarr/tf update image recognition models docs (#816) * Update Inceptionv3,DenseNet 169, Inceptionv4, ResNet50, ResNet101, MobileNet V1 quickstart scripts and docs * Update and validate MobileNet v1 for TF SPR Co-authored-by: WafaaT <[email protected]> * Fix BFloat32 precision check code for Resnet50v1.5 training model (#894) * Update 3DUNet MLperf for SPR (#889) * Updated Bert Large SPR READMEs (#887) * Included tensorflow and keras versions * updated to downloaded bert checkpoints * Fix typos in MobilenetV1 scripts (#899) * modify time function to solve int8 benchmark issue on windows (#898) * modify time function to solve int8 benchmark issue on windows * Replace the time.time function calls to time.perf_counter to improve the time statistic resolution. Updated for the additional 5 models Co-authored-by: Ying <[email protected]> * Update DIEN Training docs (#882) * Adding permissions to scripts in DIEN and correcting pb file paths in README_SPR_baremetal (#901) * Adding SPR_baremetal_readme and fixing model paths in the tables (#904) * fix acc test for single node (#903) * fix acc test for single node * Update dlrm_s_pytorch.py Co-authored-by: Weizhuo Zhang <[email protected]> * commit cherry-picks from r2.9 (#900) * update tbb files (#843) * fix vulnerability issues reported by snyk scans (#848) * upgrade for ipex 1.13 * Update Pillow to '>=9.3.0' (#884) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * fix some bugs for p99 (#909) * Update tensorflow benchmarks to use latest horovod commit (#908) * Update start.sh * Update start.sh * Update to use shortened commit hash * do not convert data to bf16 while using fp32 and bf32 (#911) Co-authored-by: Weizhuo Zhang <[email protected]> * Update SSD-Resnet34 training docs for SPR task (#914) * Update SSD-Resnet34 training & docs for SPR * Vruddarr/tf update ssd mobilenet docs (#846) * Update quick start scripts and spec file to run for all precisions * Update and validate SSD-Mobilenet docs for TF SPR Co-authored-by: WafaaT <[email protected]> * fix print issue (#915) Co-authored-by: Weizhuo Zhang <[email protected]> * Update rfcn docs to use same quick start scripts (#897) * Update rfcn docs to use same quick start scripts Co-authored-by: WafaaT <[email protected]> * Sharvils/spr ssd training (#917) * Dockerfile updated * Update SSD-ResNet34 Inference docs (#866) * Update ResNet34 Inference to use same scripts & docs for all precisions * Update for SPR WW42 Co-authored-by: WafaaT <[email protected]> * Update transformer_mlperf scripts and README fro SPR WW42 (#891) Co-authored-by: Wafaa Taie <[email protected]> * Update TF models spec files for SPR WW42 (#919) * update TF models spec files for spr ww42 * update docker partial for tf addons version * workaround rdma config for spr (#925) * remove supported OS checks (#926) * Update Model paths in main readme (#928) * Remove Linux/windows OS platform support checks (#927) * update resnet50 distributed training script (#923) * resnet50 distributed training: use logical core for ccl (#930) * Update bert scripts to add same quick start scripts to all precisions (#910) * Update MobilenetV1 SPR docs (#931) * Update Resnet50v1_5_SPR_docs (#934) * Update SSD-Mobilenet SPR docs (#935) * Update Resenet50v1.5 inference SPR docs (#933) * Fix DIEN inference.sh script and add pretrained model env var in mobilenetv1 SPR baremetal readme (#939) * Update DIEN Inference and Training SPR docs (#937) * Update SSD-Resnet34 training SPR docs (#936) * Update SSD-Resnet34 Inference SPR docs (#938) * Update README_SPR_baremetal.md remove steps and warm_up steps env vars Co-authored-by: Wafaa Taie <[email protected]> * BERT training dockerfile fixed (#921) * BERT repo version fixed for SPR container (#920) * Update spr baremetal instructions for 3dunet, bert large and transformer mlperf (#932) * Update Transformer MLPerf inference docs for pre-trained models (#940) * Fix Language Translation BERT quickstart scripts (#941) * fix scripts to detect the number of cores * Update mlperf_gnmt docs (#945) * Updating Transformer_LT_official scripts (#913) * Add support for dGPU models (#840) (#948) * Add support for dGPU models (#840) * upgrade Pillow version for Yolov4 * Update main README.md (#947) * update main readme * edit transformer_mlperf and bert SPR docs * remove workflows * Fix CVEs based on Snyk scans in TL notebooks (#951) * fix snyk critical issues in TL jupyter notebooks * Remove INC dependency for Snyk issues (#953) * removed neuralcompressorfor to avoid vulnerability in Snyk scans * Remove pointers to BERT Large int8 docs (#952) * fix int8 model link (#958) * Fixed num_intra_threads for bfloat16 (#959) (#960) * Fixed num_intra_threads for bfloat16 * Modified open mpi instructions * Added kmp_blocktime for bfloat16 Co-authored-by: mahathis <[email protected]> * Fix syntax error and pythonpath in ssd-resnet34 training (#962) (#965) Co-authored-by: Veena2207 <[email protected]> * fix training bkms (#967) (#968) * fix T5 inference script (#969) --------- Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: zhuhaozhe <[email protected]> Co-authored-by: jianan-gu <[email protected]> Co-authored-by: Dina Suehiro Jones <[email protected]> Co-authored-by: Wang, Chuanqi <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: Weizhuo Zhang <[email protected]> Co-authored-by: Melanie Buehler <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: leslie-fang-intel <[email protected]> Co-authored-by: xiaofeij <[email protected]> Co-authored-by: jiayisunx <[email protected]> Co-authored-by: Sean-Michael Riesterer <[email protected]> Co-authored-by: liangan1 <[email protected]> Co-authored-by: blzheng <[email protected]> Co-authored-by: Om Thakkar <[email protected]> Co-authored-by: mahathis <[email protected]> Co-authored-by: Srini511 <[email protected]> Co-authored-by: Clayne Robison <[email protected]> Co-authored-by: Neo Zhang Jianyu <[email protected]> Co-authored-by: ltsai1 <[email protected]> Co-authored-by: Jitendra Patil <[email protected]> Co-authored-by: Kanvi Khanna <[email protected]> Co-authored-by: Rahul Nair <[email protected]> Co-authored-by: Veena2207 <[email protected]> Co-authored-by: jojivk-intel-nervana <[email protected]> Co-authored-by: xiangdong <[email protected]> Co-authored-by: Huang, Zhiwei <[email protected]> Co-authored-by: gera-aldama <[email protected]> Co-authored-by: Sharvil Shah <[email protected]> Co-authored-by: wyang2 <[email protected]> Co-authored-by: Yimei Sun <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: tangleintel <[email protected]> Co-authored-by: Syed Shahbaaz Ahmed <[email protected]> Co-authored-by: Er-Xin (Edwin) Shang <[email protected]> Co-authored-by: Ying <[email protected]> Co-authored-by: sevdeawesome <[email protected]> Co-authored-by: DiweiSun <[email protected]>
* [RNN-T training] Enable FP32 gemm using oneDNN (#531) * Update the Readme guide for distilbert (#534) * Update the Readme guide for distilbert * Fix accuracy grep bug, and grep accuracy for distilbert Co-authored-by: Weizhuo Zhang <[email protected]> * Update end2end public dockerfile to look for IPEX in the conda directory (#535) * Notebook to script conversion example (#516) * Add notebook script conversion example * Fixed doc * Replaces custom preprocessor with built-in one * Changed tag to remove_for_custom_dataset * Add URL check prior to calling urlretrieve (#538) * Add URL check prior to calling urlretrieve Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo Signed-off-by: Abolfazl Shahbazi <[email protected]> * disable for ssd since fused cat cat kernel is slow (#537) * fix bug when adding steps in rnnt inference (#528) * Fix and updates for TensorFlow WW18-2022 SPR (#542) * Fix and updates for TensorFlow WW18-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix TensorFlow SPR nightly versions Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update pre-trained models download URLs Signed-off-by: Abolfazl Shahbazi <[email protected]> * Intall Python 3.8 development tools Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix OpenMPI install and setup Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Horovod Installaion for SPR and CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Python3.8 version for CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo in TensorFlow 3d-unet partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a broken partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add TCMalloc to TF base container for SPR and remove OpenSSL Signed-off-by: Abolfazl Shahbazi <[email protected]> * Remove some repositories Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'matplotlib' for '3d-unet' Signed-off-by: Abolfazl Shahbazi <[email protected]> * switch to build OpenMPI due to issue in Market Place provided version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYTORCH_WHEEL and IPEX_WHEEL arg values Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix and updates for PyTorch WW14-2022 SPR (#543) * Fix and updates for PyTorch WW14-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix and updates for TensorFlow WW18-2022 SPR Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix TensorFlow SPR nightly versions Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update pre-trained models download URLs Signed-off-by: Abolfazl Shahbazi <[email protected]> * Intall Python 3.8 development tools Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix OpenMPI install and setup Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Horovod Installaion for SPR and CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Python3.8 version for CentOS Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a typo in TensorFlow 3d-unet partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a broken partial Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add TCMalloc to TF base container for SPR and remove OpenSSL Signed-off-by: Abolfazl Shahbazi <[email protected]> * Updates required to the base image Signed-off-by: Abolfazl Shahbazi <[email protected]> * Remove some repositories Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'matplotlib' for '3d-unet' Signed-off-by: Abolfazl Shahbazi <[email protected]> * switch to build OpenMPI due to issue in Market Place provided version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYTORCH_WHEEL and IPEX_WHEEL arg values Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix PYT resnet50 quickstart scripts for both Linux and Windows (#547) * fix quickstart scripts, detect platform type, update to run with pytorch only * Fix SPR PyTorch MaskRCNN inference documentation for CHECKPOINT_DIR (#548) * Enable bert large multi stream inference (#554) * test bert multi stream module * enable input split and output concat for accuracy run * change the default num_streams batchsize cores to 56 * change ssd multi stream throughput to 1 core 1 batch * change the default parameter for rn50 ssd multi stream module * modify enable_ipex_for_squad.diff to align new multistream hint implementation * enable warmup and multi socket support * change default parameter for rn50 ssd multi stream inference * Add train-no-eval for rn50 pytorch (#555) * PyTorch SPR BERT large training updates (h5py and dataset instructions) and update LD_PRELOAD for SPR entrypoints (#550) * Add h5py install to bert training dockerfile * documentation updates * update docs, and add input_preprocessing to the wrapper package * Update LD_PRELOAD trailing : * Fix syntax * removing unnecessary change * Update DLRM entrypoint * Update docs to note that phase2 has bert_config.json in the CHECKPOINT_DIR * Fix syntax * increase shm-size to 10g * [RNN-T training] Update scripts -- run on 1S (#561) * Update maskrcnn training script to run on 1s (#562) * use single node to do ssd-rn34 training (#563) * Update training.sh (#564) * Update training.sh (#565) Use tcmalloc instead of jemalloc * use single node to do resnet50 training (#568) * add numactl -C and remove jit warm in main thread (#569) * Update unit-test.yml (#546) * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Update unit-test.yml * Fixed make command, updated pip install. Fixed make command to run from the root directory. Replaced pip install tox with a pip install -r requirements-tests.txt to install all dependencies for the tests. * Add tox to test dependencies. Added tox to the dependencies so that the Workflow and others may install it with pip install -r requirements-test.txt and be covered for running make lint and make unit-test. * Update unit-test.yml Changed 'make unit-test' to 'make unit_test' as that is the actual target defined in the Makefile. * Update unit-test.yml Changed apt-get install command. * re-enable int8 for api change (#579) * saperate fully convergency test from training test (#581) Co-authored-by: jianan-gu <[email protected]> * ssd enable new int8 (#580) * v1 * enable new int8 method * Revert "ssd enable new int8 (#580)" (#584) This reverts commit 9eb3211. * Revert "re-enable int8 for api change (#579)" (#583) This reverts commit 0bded92. * Update training script using 1s (#560) * Enable checkpoint during training for bert-large (#573) * minor fix * Add readme for enabling checkpoint * update phase1 to enable checkpoint by default * Update README.md * Enable ssd bf32 inference training (#589) * enable ssd bf32 inference * enable ssd bf32 train * enable RNN-T bf32 inference (#591) * Enable bf32 for bert and distilbert for inference (#593) * enable bf32 distilbert * enable bert bf32 * Enable RNN-T bf32 training (#594) * enable maskrcnn bf32 inference and training (#595) * enable resnet50 and resnext101 bf16 path (#596) * enable bert bf32 train (#600) * update resnet int8 path using new int8 api (#603) * re-enable int8 for api change (#604) Co-authored-by: jianan-gu <[email protected]> * Leslie/ssd enable new int8 (#605) * v1 * enable new int8 method * update json file * add rn50 int8 weight sharing Co-authored-by: Jiang, Xiaofei <[email protected]> * update ssd training bs to the multily of core numbers (#606) * enable bf32 for dlrm (#607) Co-authored-by: jianan-gu <[email protected]> * Update IPEX new int8 API enabling for distilbert/bert-large (#608) * enable distilbert * enable bert * fix max-ind-range and add memory info (#609) Co-authored-by: jianan-gu <[email protected]> * Remove debug code (#610) * update training steps (#611) * fix bandit scan fails (#612) * PYT Image recognition models support on Windows (#549) * fix all image recognition scripts to run on windows and linux with PYT, and only linux with IPEX * [RNN-T training] fix bandit scan fails (#614) * RNN-T inference: fix IMZ Bandit scan fails (#615) * Update unit-test.yml (#570) Changed the docker user credential to utilize GitHub Secret. * MaskRCNN: fix IMZ Bandit scan fails (#623) * Fix for horovod-related failures in TF nightly runs (#613) * cpp17 horovod failure fix * minor debugging changes * minor fixes - directory name * cleanup * addressing reviewer comments * Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 (#624) * Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Set 'HOROVOD_WITH_MPI=1' explicitly Signed-off-by: Abolfazl Shahbazi <[email protected]> * update GCC version to GCC 9 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'horovodrun --check-build' for sanity check Signed-off-by: Abolfazl Shahbazi <[email protected]> * removo force install inside Docker Signed-off-by: Abolfazl Shahbazi <[email protected]> * [RNN-T training] Fix ddp sample number issue (#625) * update BF32 usage (#627) * resnet50 training: add warm up before collecting time (#628) * image to bf16 (#629) * Update end2end DLSA dockerfile due to SPR wheel path update and removing int8 patch (#631) * Update mlpc path for SPR wheels * remove patch * Update Horovod commit id for BareMetal, Docker will be updated next (#630) Signed-off-by: Abolfazl Shahbazi <[email protected]> * fix dlrm convergence and change training performance BS to 32K (#633) Co-authored-by: jianan-gu <[email protected]> * [RNN-T training] Merge sh files to one (#635) * update torch-ccl into 1.12 (#636) * Liangan1/update torch ccl version (#637) * Update torch_ccl version * resnet50_distributed_training: don't set MASTER_ADDR by user (#638) * Update torch_ccl in script (#639) * Enable offline download distilbert (#632) * enable offline download distilbert * add convert * Update README.md * add accuracy.py * add file * refine download * refine path * refine path * add license * Update dlrm_s_pytorch.py (#643) * Update README.md (#649) * init pytorch T5 language model (#648) * init pytorch T5 language model * update README.md * update doc * update fpn models (#650) * pytorch resnet50: directly call ipex.quantization (#653) * fix int8 accuracy (#655) Co-authored-by: Zhang, Weizhuo <[email protected]> * Made fixes to the broken links (#652) * Update Security Center URL (#657) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Weizhuoz/fix for pt 1.12 (#656) * fix vgg11_bn accuracy syntax error * remove exact_match from roberta-base * modify maskrcnn BS to 2*num_cores * Update dlrm_s_pytorch.py (#660) * Update dlrm_s_pytorch.py Reduce int8 memory usage. * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Update dlrm_s_pytorch.py * Add BF32 DDP for bert-large (#663) * Update run_ddp_bert_pretrain_phase1.sh * Update run_ddp_bert_pretrain_phase2.sh * Update README.md * move OMP_NUM_THREADS=1 into dlrm_s_pytorch.py (#664) minor changes * remove rn50 ao (#665) * Re-organize models list to be grouped by framework (#654) * re-organize models list to be grouped by framework * update tensorflow ssd-resnet34 training dataset * add T5 in benchmark/README.md * mannuel set torch num threads only for int8 (#666) * Update inference_performance.sh (#669) * improve ssdrn34 perf. (#671) * improve ssdrn34 perf. * minor update. * Fix linting Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix unit tests too Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * Use IPEX Pytorch whls instead of building IPEX from source (#674) * Use IPEX Pytorch whls instead of building IPEX from source * Corrected the link to install pytorch/IPEX * Corrected the link to install pytorch/IPEX * Updated the link with latest tutorial to install pytorch/IPEX * Update docs/general/pytorch/BareMetalSetup.md Co-authored-by: Clayne Robison <[email protected]> * Update docs/general/pytorch/BareMetalSetup.md Co-authored-by: Clayne Robison <[email protected]> * Made the suggested tweaks in the names * Adding condition to install jemalloc and tcmalloc Co-authored-by: Clayne Robison <[email protected]> * Added condition to install jemalloc, tcmalloc, vision and torch-ccl * Added some tweaks Co-authored-by: Clayne Robison <[email protected]> * Lpot2inc (#446) * draft for lpot quantization and perf analysis jupyter notebook * update with formal name of model zoo, correct wrong words, add license in python file * rm empty line * renmae LPOT to INC in text and code, and use new api * Update README.md * Update set_env.sh * Update README.md * Update ut.sh * Update local_banchmark.sh * Create local_benchmark.sh * Update README.md * Update inc_for_tensorflow.ipynb * Update ut.sh * Update README.md * rename to local_benchmark.sh * Update ut.sh * Update ut.sh * Update run_jupyter.sh * Delete lpot_for_tensorflow.ipynb * Delete lpot_quantize_model.py * Update README.md * Update README.md * Update README.md * Update inc_for_tensorflow.ipynb * Update README.md * Update README.md * Update inc_for_tensorflow.ipynb * Update requirements.txt Co-authored-by: ltsai1 <[email protected]> * Sriniva2/ssd rn34 (#682) * improve ssdrn34 perf. * minor update. * enabling synthetic data. * Update base_benchmark_util.py * Fix linting error Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * Add doc updates for '--synthetic-data' option (#683) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Change checkpoint setting for Bert train phase 1 (#602) * Change checkpoint setting for Bert train phase 1 * fix model and config saving * fix error when runing gpu path (#686) * fix load pretrained model error when using torch_ccl (#688) * update py version in base spec (#678) (#690) * TF addons upgrade to 0.17.1 (#689) (#691) * updated tf adons version * remove comment * Update Dockerfiles prior to IMZ 2.8 release (#693) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update Documents prior to IMZ 2.8 release (#694) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update README.md (#697) * change numpy version requirement (#703) * Remove MiniGo training from IMZ (#644) * remove MiniGo training scripts and unit test * [RNN-T] [Inference] optimize the batch decoder (#711) * reduce fill_ OP in rnnt embedding kernel * optimize add between int and log to reduce dtype conversion * rnnt: support dump tracing file and print profile table (#712) * add support for open SUSE leap operating system (#708) * rnnt inference: pre convert data to bf16 (#713) * remove squeeze/slice/transpose (#714) * update resnet50 training code (#710) * update resnet50 training code * not using ipex optimize for resnet50 training * use ipex.optimize() on the whole model (#718) * resnet50 bf32: calling ipex.optimize to enable bf32 path (#719) * Added batch size as an env variable to the quickstart scripts (#676) * WIP: Adding batch size as an environment variable to the quickstart scripts * Added instructions in README.md for all workloads * Update README.md * Corrected typo in launch_benchmark * Made corrections to .docs and ran model-builder * Delete .README.md.swp * Delete .fp32_accuracy.sh.swp * Update quickstart/image_segmentation/tensorflow/3d_unet_mlperf/inference/cpu/inference_throughput.sh Co-authored-by: Clayne Robison <[email protected]> * Update quickstart/language_translation/tensorflow/transformer_mlperf/inference/cpu/inference_realtime.sh Co-authored-by: Clayne Robison <[email protected]> * Update benchmarks/launch_benchmark.py Co-authored-by: Clayne Robison <[email protected]> * Made corrections to batch-size parameter * Made changes in launch_benchmark for batch-size arg * Made modifications to the README's * Resolved merge conflict by keeping README.md file. * Modified readme for windows * Resolved merge conflict by keeping README.md file. * Corrected SPR run.sh scripts * Removed echo from run.sh Co-authored-by: Clayne Robison <[email protected]> * Added batchsize as an env variable to quickstart scripts (#680) * Added batchsize as an env variable to quickstart scripts * Made modifications to .docs and scripts * Made modifications to README * Resolved merge conflict by incorporating both suggestions. * Made corrections in README.md * Made corrections in README.md * Undo changes in training.sh file * updated readme: nit fix (#723) Co-authored-by: Rahul Nair <[email protected]> * compute throughput by test_mini_batch_size (#740) * pytorch resnet50: fix bf32 training path error (#739) * Fix a subtle 'E275' style issue that causes unknown behavior (#742) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * rearrange the paragraphs and fix Markdown headers (#744) * Align Transformers version for BERT models (#738) * align transformer version(4.18) for bert models * change scripts to legacy * redo calibration * patch fix * Update README.md (#746) * Add support for stock PYT- object detection models (#732) * stock PYT and windows support for object detection models * Weizhuoz/reduce model zoo steps (#762) * reduce steps for bert-base, roberta, fpn models * modify max_iter for fpn models * reduce all img classification models steps * update new config for bert models (#763) * Addin Scipy for TensorFlow serving SSD-MobileNet model (#764) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update TF ResNet50v1.5 inference for SPR (baremetal) (#749) * Added matplotlib dependency to image_segmentation requirements (#768) * Update readmes for the path to output directory (#769) * update wide & deep readme for the path to pretrained model directory (#771) * add a check for ubuntu 22.04 support (#721) * Changes to add bfloat16 support for DIEN training (#679) * Changes to add bfloat16 support for DIEN training * Some for for reporting performance * Fixes for dien training and unit tests * updated tpp file withr2.8 approvals (#773) * Add Windows stock PyTorch support for TransNet v2 (#779) * update TransNet v2 to work with stock pytorch * update Windows.md path in all relevant docs * add P99 metric for LZ models (#780) Co-authored-by: Weizhuo Zhang <[email protected]> * Rn50 training multiple epoches output 1 KPI and add training_steps argument. (#775) * enable --training_steps and 1 training KPI output with multiple epoches * add prefix * update print freq * fix display bug * enable PyTorch resnet50 fp16 path (#783) * enable PyTorch resnet50 fp16 path * fix conflict * Extract p99 metric from log to summary (#784) * enable fp16 bert train and inference (#782) * Vruddarr/pt update windows readmes (#778) * remove bfloat16 experimental support note (#786) * Update IPEX installation path (#788) * Clean up _pycache_ files, remove symlinks, and add license headers for dien training bf16 (#787) * update readme for jemalloc and iomp path (#789) * update readme for jemalloc and iomp path * Updated IOMP path as path to the intel-openmp directory * PyTorch: fix resnext101 running script (#795) * Update 3dunet mlperf bash scripts and README (#797) * update 3dunet mlperf doc to use quickstart scripts, rename quickstart scripts for multi-instance * fix tests job (#803) * rnnt inference: align replace lstm API due to IPEX change (#802) * Adding quick start scripts to MobileNetV1 bfloat16 precision (#793) * Adding quick start scripts to MobileNetV1 bfloat16 precision * Adding executable permissions to files * Adding aikit.md to docs file * updated the comments on readme Co-authored-by: veena.mounika.ruddarraju <[email protected]> * Adding quick start scripts to ssd-mobilenet bfloat16 precision (#798) * Adding quick start scripts to ssd-mobilenet bfloat16 precision * changed file permissions * Updated comments on readme file Co-authored-by: veena.mounika.ruddarraju <[email protected]> * Update T5 model with windows quick start scripts (#790) * Update T5 model with windows quick start scripts * Updated Readme by specifying values to environment variables * Update inference int8 readme and script of 4 CV models using INC (#698) * update docs to add INC int8 models as an option * add instructions for how to quantize a fp32 model using INC * rnnt: fix stft due to PyTorch API change (#811) * rnnt training: fix stft due to PyTorch API change (#813) * Update BareMetalSetup.md (#817) * Gerardod/build container (#807) First phase of GHA WF to build the image of a Model Zoo workload container and push it to CAAS. * Sharvils/tf workload (#808) * TFv2.10 support added. Horovod version updated. * Vruddarr/tf add language translation bert fp32 quick start scripts (#804) * Adding quick start scripts to language translation BERT FP32 model * Corrected typo errors * Changed path to the Readme * Adding spec file <bert-fp32-inference_spec.yml> * Update spec file and model link in Readme tables * Update Readme path in windows.md * Updated TL notebooks for SPR Launch (#810) * Updates for TL PyTorch notebook * Edits for two more TL notebooks * Reverting previous change for virtualenv * Removed --no-deps and some nonexistent links * Added TFHub cache dir * Updated TL notebook README for legal/branding * Update typo in Readme (#821) * PyTorch: using ipex.optimize for bf16 training (#824) * Fix CVEs for Pillow and notebook packages (#831) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * add intel-alphafold2 optimized w/ IPEX from realm of AIDD (#737) * add alphafold2 from AIDD realm * Remove unused variable in mlperf 3DUnet performance run (#832) * Update Model Zoo name, Python version and message for IPEX (#833) Co-authored-by: veena.mounika.ruddarraju <[email protected]> * Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updt… (#830) * Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updting the readme by replacing conda with Miniconda. * Adding comment to install torch in BareMetalSetup.md * Update models main tables (#836) *update main readmes * Adding jemalloc instructions and environment variables (#838) * DLRM hybrid gradient product (#814) * enable hybrid mergedembedding * Hybrid Merge embedding * refine code * Update model file * Fix data loader issue for distributed trianing * Update the print info * Fix lr issue for sparse table both 2/8 ranks get convergenced with 0.75 epochs Co-authored-by: root <[email protected]> * update the TTT evaluation method by excluding dataloader & metric evaluation (#844) Co-authored-by: Zhang, Liangang <[email protected]> * PyTorch: resnet50 distributed training using lars optimizer (#826) * modify dlrm's sklearn metric eval func to ipex's multi-thread version (#850) * modify recall/precision/f1/ap 's eval as optional (#856) * Port dataloader optimization for distributed training of dlrm (#847) * update the TTT evaluation method by excluding dataloader & metric evaluation * port dataloader optimization for distributed training of dlrm * modify dlrm's sklearn metric eval func to ipex's multi-thread version (#850) * modify recall/precision/f1/ap 's eval as optional (#856) * port dataloader optimization for distributed training of dlrm * delete local bs computation in evaluation stage * modify the TTT output name Co-authored-by: Zhang, Liangang <[email protected]> * Update horovod version to fix run time failure due to Status call (#859) * fix regression for dlrm single node training (#864) Co-authored-by: Weizhuo Zhang <[email protected]> * Update pytorch model zoo table of BF32 with landing zoo models (#865) * Added SNYK scan (#855) * Update SSD-ResNet34 code in start.sh(#862) * Add Distilbert base model for inference (Tensorflow) to model zoo (#815) * Add fp32 inference for distilbert base model * Fix Bert spec file (#873) * 1) Add torch.profiler (#871) 2) change the distributed_training.sh for dlrm to diamond cluster * Update Wide & Deep docs (#875) * The copy of #867(Porting evaluation iteration overlapping) (#876) * port evaluation overlapping * remove debug code * remove debug code * remove unused code * remove unused code * add resnet50 distributed training script (#879) * add resnet50 distributed training script * collect TTT Co-authored-by: XiaobingSuper <[email protected]> * reduce redundant bus traffic (#880) * Port all_to_all index overlapping with interaction and top mlp. (#878) * port all_to_all index overlapping with interaction and top mlp * fix seg fault * Add int8 support for distilbert (#823) * Add fp32 inference for distilbert base model Co-authored-by: syedshahbaaz <[email protected]> * Update DIEN inference docs & quickstart scripts (#869) * Update DIEN docs * update for spr ww42 Co-authored-by: WafaaT <[email protected]> * Update ResNet50v1.5 docs (#820) * Update and Validate ResNet50v1.5 Inference and training model for TF SPR * Update and validate docs for TF SPR Co-authored-by: WafaaT <[email protected]> * Update Wide & Deep using Large Dataset docs (#877) * Vruddarr/tf bfloat32 precision check (#893) * Update Wide and Deep Large Dataset Training Model docs (#881) * Vruddarr/tf update image recognition models docs (#816) * Update Inceptionv3,DenseNet 169, Inceptionv4, ResNet50, ResNet101, MobileNet V1 quickstart scripts and docs * Update and validate MobileNet v1 for TF SPR Co-authored-by: WafaaT <[email protected]> * Fix BFloat32 precision check code for Resnet50v1.5 training model (#894) * Update 3DUNet MLperf for SPR (#889) * Updated Bert Large SPR READMEs (#887) * Updated Bert Large SPR READMEs * Included tensorflow and keras versions * Updated bert large README for spr * Updated scripts and README as per reviews * Update SPR quickstart description * updated to downloaded bert checkpoints * Fix typos in MobilenetV1 scripts (#899) * modify time function to solve int8 benchmark issue on windows (#898) * modify time function to solve int8 benchmark issue on windows * Replace the time.time function calls to time.perf_counter to improve the time statistic resolution. Updated for the additional 5 models Co-authored-by: Ying <[email protected]> * Update DIEN Training docs (#882) * Adding permissions to scripts in DIEN and correcting pb file paths in README_SPR_baremetal (#901) * Adding SPR_baremetal_readme and fixing model paths in the tables (#904) * fix acc test for single node (#903) * fix acc test for single node * Update dlrm_s_pytorch.py Co-authored-by: Weizhuo Zhang <[email protected]> * commit cherry-picks from r2.9 (#900) * update tbb files (#843) * fix vulnerability issues reported by snyk scans (#848) * upgrade for ipex 1.13 * Update Pillow to '>=9.3.0' (#884) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * fix some bugs for p99 (#909) * Update tensorflow benchmarks to use latest horovod commit (#908) * Update start.sh * Update start.sh * Update to use shortened commit hash * do not convert data to bf16 while using fp32 and bf32 (#911) Co-authored-by: Weizhuo Zhang <[email protected]> * Update SSD-Resnet34 training docs for SPR task (#914) * Update SSD-Resnet34 training & docs for SPR * Vruddarr/tf update ssd mobilenet docs (#846) * Update quick start scripts and spec file to run for all precisions * Update and validate SSD-Mobilenet docs for TF SPR Co-authored-by: WafaaT <[email protected]> * fix print issue (#915) Co-authored-by: Weizhuo Zhang <[email protected]> * Update rfcn docs to use same quick start scripts (#897) * Update rfcn docs to use same quick start scripts Co-authored-by: WafaaT <[email protected]> * Sharvils/spr ssd training (#917) * Dockerfile updated * Update SSD-ResNet34 Inference docs (#866) * Update ResNet34 Inference to use same scripts & docs for all precisions * Update for SPR WW42 Co-authored-by: WafaaT <[email protected]> * Update transformer_mlperf scripts and README fro SPR WW42 (#891) Co-authored-by: Wafaa Taie <[email protected]> * Update TF models spec files for SPR WW42 (#919) * update TF models spec files for spr ww42 * update docker partial for tf addons version * workaround rdma config for spr (#925) * remove supported OS checks (#926) * Update Model paths in main readme (#928) * Remove Linux/windows OS platform support checks (#927) * update resnet50 distributed training script (#923) * resnet50 distributed training: use logical core for ccl (#930) * Update bert scripts to add same quick start scripts to all precisions (#910) * Update MobilenetV1 SPR docs (#931) * Update Resnet50v1_5_SPR_docs (#934) * Update SSD-Mobilenet SPR docs (#935) * Update Resenet50v1.5 inference SPR docs (#933) * Fix DIEN inference.sh script and add pretrained model env var in mobilenetv1 SPR baremetal readme (#939) * Update DIEN Inference and Training SPR docs (#937) * Update SSD-Resnet34 training SPR docs (#936) * Update SSD-Resnet34 Inference SPR docs (#938) * Update README_SPR_baremetal.md remove steps and warm_up steps env vars Co-authored-by: Wafaa Taie <[email protected]> * BERT training dockerfile fixed (#921) * BERT repo version fixed for SPR container (#920) * Update spr baremetal instructions for 3dunet, bert large and transformer mlperf (#932) * Update Transformer MLPerf inference docs for pre-trained models (#940) * Fix Language Translation BERT quickstart scripts (#941) * fix scripts to detect the number of cores * Update mlperf_gnmt docs (#945) * Updating Transformer_LT_official scripts (#913) * Add support for dGPU models (#840) (#948) * Add support for dGPU models (#840) * upgrade Pillow version for Yolov4 * Update main README.md (#947) * update main readme * edit transformer_mlperf and bert SPR docs * remove workflows * Fix CVEs based on Snyk scans in TL notebooks (#951) * fix snyk critical issues in TL jupyter notebooks * Remove INC dependency for Snyk issues (#953) * removed neuralcompressorfor to avoid vulnerability in Snyk scans * Remove pointers to BERT Large int8 docs (#952) * fix int8 model link (#958) * Fixed num_intra_threads for bfloat16 (#959) (#960) * Fixed num_intra_threads for bfloat16 * Modified open mpi instructions * Added kmp_blocktime for bfloat16 Co-authored-by: mahathis <[email protected]> * Fix syntax error and pythonpath in ssd-resnet34 training (#962) (#965) Co-authored-by: Veena2207 <[email protected]> * fix training bkms (#967) (#968) * fix T5 inference script (#969) * Fix resnet50v1.5 weightsharing for int8 (#996) * Corrected typo in SPR quickstart scripts (#991) * fix model_init for int8 weightsharing --------- Co-authored-by: mahathis <[email protected]> * TF SPR DevCatalog READMEs (#983) * add image recognition devcats * add tf object detection devcats * add TF language translation devcats * add tf image segmentation devcats * add tf language modeling devcats * add recommendation tf devcats * fix swapped containers and precision in run command * add README_SPR to all getting started links and correct script names * rename files and point getting started to itself * fix last link * fix minor error (#994) * Update TF SPR ww42 containers partials, spec-files and dockerfiles (#998) TF SPR Containers Built and Validated * Sharvils/tf devcats fixes (#995) Minor fixes to SPR TF DevCatalogs --------- Co-authored-by: sharvil.shah * SPR PyTorch DevCatalogs (#993) Added Devcatalog files targeting SPR container launch * Delete SPR containers README_SPR.md (#999) * delete README_SPR.md * remove references in spec-files * fix for auto-merge --------- Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: jianan-gu <[email protected]> Co-authored-by: Weizhuo Zhang <[email protected]> Co-authored-by: Dina Suehiro Jones <[email protected]> Co-authored-by: Melanie Buehler <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: leslie-fang-intel <[email protected]> Co-authored-by: xiaofeij <[email protected]> Co-authored-by: jiayisunx <[email protected]> Co-authored-by: zhuhaozhe <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: Sean-Michael Riesterer <[email protected]> Co-authored-by: liangan1 <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: blzheng <[email protected]> Co-authored-by: Om Thakkar <[email protected]> Co-authored-by: mahathis <[email protected]> Co-authored-by: Srini511 <[email protected]> Co-authored-by: Clayne Robison <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Neo Zhang Jianyu <[email protected]> Co-authored-by: ltsai1 <[email protected]> Co-authored-by: Jitendra Patil <[email protected]> Co-authored-by: Kanvi Khanna <[email protected]> Co-authored-by: Rahul Nair <[email protected]> Co-authored-by: Veena2207 <[email protected]> Co-authored-by: jojivk-intel-nervana <[email protected]> Co-authored-by: xiangdong <[email protected]> Co-authored-by: Huang, Zhiwei <[email protected]> Co-authored-by: gera-aldama <[email protected]> Co-authored-by: Sharvil Shah <[email protected]> Co-authored-by: wyang2 <[email protected]> Co-authored-by: Yimei Sun <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: tangleintel <[email protected]> Co-authored-by: Syed Shahbaaz Ahmed <[email protected]> Co-authored-by: Er-Xin (Edwin) Shang <[email protected]> Co-authored-by: Ying <[email protected]> Co-authored-by: sevdeawesome <[email protected]> Co-authored-by: DiweiSun <[email protected]> Co-authored-by: Tyler Titsworth <[email protected]> Co-authored-by: Srikanth Ramakrishna <[email protected]>
* Adding README files for Intel® Data Center Flex Series GPUs (intel#125) * fix incorrect links (intel#127) * bump ipython to fix CVE (intel#128) --------- Signed-off-by: WafaaT <[email protected]> Co-authored-by: Clayne Robison <[email protected]>
* Use IPEX Pytorch whls instead of building IPEX from source (#674) * Lpot2inc (#446) * draft for lpot quantization and perf analysis jupyter notebook Co-authored-by: ltsai1 <[email protected]> * Sriniva2/ssd rn34 (#682) * improve ssdrn34 perf. * minor update. * enabling synthetic data. * Update base_benchmark_util.py * Fix linting error Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * Add doc updates for '--synthetic-data' option (#683) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Change checkpoint setting for Bert train phase 1 (#602) * Change checkpoint setting for Bert train phase 1 * fix model and config saving * fix error when runing gpu path (#686) * fix load pretrained model error when using torch_ccl (#688) * update py version in base spec (#678) (#690) * TF addons upgrade to 0.17.1 (#689) (#691) * updated tf adons version * remove comment * Update Dockerfiles prior to IMZ 2.8 release (#693) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update Documents prior to IMZ 2.8 release (#694) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update README.md (#697) * change numpy version requirement (#703) * Remove MiniGo training from IMZ (#644) * remove MiniGo training scripts and unit test * [RNN-T] [Inference] optimize the batch decoder (#711) * reduce fill_ OP in rnnt embedding kernel * optimize add between int and log to reduce dtype conversion * rnnt: support dump tracing file and print profile table (#712) * add support for open SUSE leap operating system (#708) * rnnt inference: pre convert data to bf16 (#713) * remove squeeze/slice/transpose (#714) * update resnet50 training code (#710) * update resnet50 training code * not using ipex optimize for resnet50 training * use ipex.optimize() on the whole model (#718) * resnet50 bf32: calling ipex.optimize to enable bf32 path (#719) * updated readme: nit fix (#723) Co-authored-by: Rahul Nair <[email protected]> * compute throughput by test_mini_batch_size (#740) * pytorch resnet50: fix bf32 training path error (#739) * Fix a subtle 'E275' style issue that causes unknown behavior (#742) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * rearrange the paragraphs and fix Markdown headers (#744) * Align Transformers version for BERT models (#738) * align transformer version(4.18) for bert models * change scripts to legacy * redo calibration * patch fix * Update README.md (#746) * Add support for stock PYT- object detection models (#732) * stock PYT and windows support for object detection models * Weizhuoz/reduce model zoo steps (#762) * reduce steps for bert-base, roberta, fpn models * modify max_iter for fpn models * reduce all img classification models steps * update new config for bert models (#763) * Addin Scipy for TensorFlow serving SSD-MobileNet model (#764) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update TF ResNet50v1.5 inference for SPR (baremetal) (#749) * Added matplotlib dependency to image_segmentation requirements (#768) * Update readmes for the path to output directory (#769) * update wide & deep readme for the path to pretrained model directory (#771) * add a check for ubuntu 22.04 support (#721) * Changes to add bfloat16 support for DIEN training (#679) * Changes to add bfloat16 support for DIEN training * Some for for reporting performance * Fixes for dien training and unit tests * updated tpp file withr2.8 approvals (#773) * Add Windows stock PyTorch support for TransNet v2 (#779) * update TransNet v2 to work with stock pytorch * update Windows.md path in all relevant docs * add P99 metric for LZ models (#780) Co-authored-by: Weizhuo Zhang <[email protected]> * Rn50 training multiple epoches output 1 KPI and add training_steps argument. (#775) * enable --training_steps and 1 training KPI output with multiple epoches * add prefix * update print freq * fix display bug * enable PyTorch resnet50 fp16 path (#783) * enable PyTorch resnet50 fp16 path * fix conflict * Extract p99 metric from log to summary (#784) * enable fp16 bert train and inference (#782) * Vruddarr/pt update windows readmes (#778) * remove bfloat16 experimental support note (#786) * Update IPEX installation path (#788) * Clean up _pycache_ files, remove symlinks, and add license headers for dien training bf16 (#787) * update readme for jemalloc and iomp path (#789) * update readme for jemalloc and iomp path * Updated IOMP path as path to the intel-openmp directory * PyTorch: fix resnext101 running script (#795) * Update 3dunet mlperf bash scripts and README (#797) * update 3dunet mlperf doc to use quickstart scripts, rename quickstart scripts for multi-instance * fix tests job (#803) * Adding quick start scripts to ssd-mobilenet bfloat16 precision (#798) * Update T5 model with windows quick start scripts (#790) * Update T5 model with windows quick start scripts * Updated Readme by specifying values to environment variables * Update inference int8 readme and script of 4 CV models using INC (#698) * update docs to add INC int8 models as an option * add instructions for how to quantize a fp32 model using INC * rnnt: fix stft due to PyTorch API change (#811) * rnnt training: fix stft due to PyTorch API change (#813) * Update BareMetalSetup.md (#817) * Gerardod/build container (#807) First phase of GHA WF to build the image of a Model Zoo workload container and push it to CAAS. * Sharvils/tf workload (#808) * TFv2.10 support added. Horovod version updated. * Vruddarr/tf add language translation bert fp32 quick start scripts (#804) * Adding quick start scripts to language translation BERT FP32 model * Updated TL notebooks for SPR Launch (#810) * Updates for TL PyTorch notebook * Edits for two more TL notebooks * Reverting previous change for virtualenv * Removed --no-deps and some nonexistent links * Added TFHub cache dir * Updated TL notebook README for legal/branding * Update typo in Readme (#821) * PyTorch: using ipex.optimize for bf16 training (#824) * Fix CVEs for Pillow and notebook packages (#831) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * add intel-alphafold2 optimized w/ IPEX from realm of AIDD (#737) * add alphafold2 from AIDD realm * Remove unused variable in mlperf 3DUnet performance run (#832) * Update Model Zoo name, Python version and message for IPEX (#833) * Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updt… (#830) * Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updting the readme by replacing conda with Miniconda. * Adding comment to install torch in BareMetalSetup.md * Adding IPEX version and removing *s * Update models main tables (#836) *update main readmes * Adding jemalloc instructions and environment variables (#838) * DLRM hybrid gradient product (#814) * enable hybrid mergedembedding * Hybrid Merge embedding * update the TTT evaluation method by excluding dataloader & metric evaluation (#844) Co-authored-by: Zhang, Liangang <[email protected]> * PyTorch: resnet50 distributed training using lars optimizer (#826) * modify dlrm's sklearn metric eval func to ipex's multi-thread version (#850) * modify recall/precision/f1/ap 's eval as optional (#856) * Port dataloader optimization for distributed training of dlrm (#847) * update the TTT evaluation method by excluding dataloader & metric evaluation * port dataloader optimization for distributed training of dlrm * modify dlrm's sklearn metric eval func to ipex's multi-thread version (#850) * modify recall/precision/f1/ap 's eval as optional (#856) * port dataloader optimization for distributed training of dlrm * delete local bs computation in evaluation stage * modify the TTT output name Co-authored-by: Zhang, Liangang <[email protected]> * Update horovod version to fix run time failure due to Status call (#859) * fix regression for dlrm single node training (#864) Co-authored-by: Weizhuo Zhang <[email protected]> * Update pytorch model zoo table of BF32 with landing zoo models (#865) * Added SNYK scan (#855) * Update SSD-ResNet34 code in start.sh(#862) * Add Distilbert base model for inference (Tensorflow) to model zoo (#815) * Add fp32 inference for distilbert base model * Fix Bert spec file (#873) * 1) Add torch.profiler (#871) 2) change the distributed_training.sh for dlrm to diamond cluster * Update Wide & Deep docs (#875) * The copy of #867(Porting evaluation iteration overlapping) (#876) * port evaluation overlapping * add resnet50 distributed training script (#879) * add resnet50 distributed training script * collect TTT Co-authored-by: XiaobingSuper <[email protected]> * reduce redundant bus traffic (#880) * Port all_to_all index overlapping with interaction and top mlp. (#878) * port all_to_all index overlapping with interaction and top mlp * fix seg fault * Add int8 support for distilbert (#823) * Add fp32 inference for distilbert base model Co-authored-by: syedshahbaaz <[email protected]> * Update DIEN inference docs & quickstart scripts (#869) * Update DIEN docs * update for spr ww42 Co-authored-by: WafaaT <[email protected]> * Update ResNet50v1.5 docs (#820) * Update and Validate ResNet50v1.5 Inference and training model for TF SPR * Update and validate docs for TF SPR Co-authored-by: WafaaT <[email protected]> * Update Wide & Deep using Large Dataset docs (#877) * Vruddarr/tf bfloat32 precision check (#893) * Update Wide and Deep Large Dataset Training Model docs (#881) * Vruddarr/tf update image recognition models docs (#816) * Update Inceptionv3,DenseNet 169, Inceptionv4, ResNet50, ResNet101, MobileNet V1 quickstart scripts and docs * Update and validate MobileNet v1 for TF SPR Co-authored-by: WafaaT <[email protected]> * Fix BFloat32 precision check code for Resnet50v1.5 training model (#894) * Update 3DUNet MLperf for SPR (#889) * Updated Bert Large SPR READMEs (#887) * Fix typos in MobilenetV1 scripts (#899) * modify time function to solve int8 benchmark issue on windows (#898) * modify time function to solve int8 benchmark issue on windows * Replace the time.time function calls to time.perf_counter to improve the time statistic resolution. Updated for the additional 5 models Co-authored-by: Ying <[email protected]> * Update DIEN Training docs (#882) * Adding permissions to scripts in DIEN and correcting pb file paths in README_SPR_baremetal (#901) * Adding SPR_baremetal_readme and fixing model paths in the tables (#904) * fix acc test for single node (#903) * fix acc test for single node * Update dlrm_s_pytorch.py Co-authored-by: Weizhuo Zhang <[email protected]> * commit cherry-picks from r2.9 (#900) * update tbb files (#843) * fix vulnerability issues reported by snyk scans (#848) * upgrade for ipex 1.13 * Update Pillow to '>=9.3.0' (#884) Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * fix some bugs for p99 (#909) * Update tensorflow benchmarks to use latest horovod commit (#908) * Update start.sh * Update start.sh * Update to use shortened commit hash * do not convert data to bf16 while using fp32 and bf32 (#911) Co-authored-by: Weizhuo Zhang <[email protected]> * Update SSD-Resnet34 training docs for SPR task (#914) * Update SSD-Resnet34 training & docs for SPR * Vruddarr/tf update ssd mobilenet docs (#846) * Update quick start scripts and spec file to run for all precisions * Update and validate SSD-Mobilenet docs for TF SPR Co-authored-by: WafaaT <[email protected]> * fix print issue (#915) Co-authored-by: Weizhuo Zhang <[email protected]> * Update rfcn docs to use same quick start scripts (#897) * Update rfcn docs to use same quick start scripts Co-authored-by: WafaaT <[email protected]> * Sharvils/spr ssd training (#917) * Dockerfile updated * Update SSD-ResNet34 Inference docs (#866) * Update ResNet34 Inference to use same scripts & docs for all precisions * Update for SPR WW42 Co-authored-by: WafaaT <[email protected]> * Update transformer_mlperf scripts and README fro SPR WW42 (#891) Co-authored-by: Wafaa Taie <[email protected]> * Update TF models spec files for SPR WW42 (#919) * update TF models spec files for spr ww42 * update docker partial for tf addons version * workaround rdma config for spr (#925) * remove supported OS checks (#926) * Update Model paths in main readme (#928) * Remove Linux/windows OS platform support checks (#927) * update resnet50 distributed training script (#923) * resnet50 distributed training: use logical core for ccl (#930) * Update bert scripts to add same quick start scripts to all precisions (#910) * Update MobilenetV1 SPR docs (#931) * Update Resnet50v1_5_SPR_docs (#934) * Update SSD-Mobilenet SPR docs (#935) * Update Resenet50v1.5 inference SPR docs (#933) * Fix DIEN inference.sh script and add pretrained model env var in mobilenetv1 SPR baremetal readme (#939) * Update DIEN Inference and Training SPR docs (#937) * Update SSD-Resnet34 training SPR docs (#936) * Update SSD-Resnet34 Inference SPR docs (#938) * Update README_SPR_baremetal.md remove steps and warm_up steps env vars Co-authored-by: Wafaa Taie <[email protected]> * BERT training dockerfile fixed (#921) * BERT repo version fixed for SPR container (#920) * Update spr baremetal instructions for 3dunet, bert large and transformer mlperf (#932) * Update Transformer MLPerf inference docs for pre-trained models (#940) * Fix Language Translation BERT quickstart scripts (#941) * fix scripts to detect the number of cores * Update mlperf_gnmt docs (#945) * Updating Transformer_LT_official scripts (#913) * Update main README.md (#947) * update main readme * edit transformer_mlperf and bert SPR docs * Fix CVEs based on Snyk scans in TL notebooks (#951) * fix snyk critical issues in TL jupyter notebooks * Remove INC dependency for Snyk issues (#953) (#954) * removed neuralcompressorfor to avoid vulnerability in Snyk scans * update spec files for pretrained models links (#957) * Fixed num_intra_threads for bfloat16 (#959) * Fixed num_intra_threads for bfloat16 * Modified open mpi instructions * Added kmp_blocktime for bfloat16 * Fix syntax error and pythonpath in ssd-resnet34 training (#962) * fix training bkms (#967) * fix T5 inference script (#969) * Fix for SSDRN34 training failure (#970) * fix for ssdrn34 nightly failure * Molly/rdma dist (#972) * revert commit 925, enable RDMA CONFIG * revert pr 925, enable rdma config * Update Serving Docs Versions (#974) * Update Versions * Add TODO * Removed precision folders and updated quickstart scripts (#922) * Removed precision folders and updated quickstart scripts * Updated README and changed script names * generated README and advanced.md * TF SPR DevCatalog READMEs (#983) * add image recognition devcats * add tf object detection devcats * add TF language translation devcats * add tf image segmentation devcats * add tf language modeling devcats * add recommendation tf devcats * fix swapped containers and precision in run command * add README_SPR to all getting started links and correct script names * rename files and point getting started to itself * fix last link * Fix spec files (#989) * fix spec file * add docker.md doc snippet * Fix ssd-mobilenet inference script (#990) remove throughput aggregation * Update Trasformer_MLPerf Inference docs to use same quick start scripts (#963) * Update benchmark readmes and fix inference.sh file --------- Co-authored-by: WafaaT <[email protected]> * Corrected typo in SPR quickstart scripts (#991) * Update Transformer_MLPerf Training docs (#973) * Update Transformer_MLPerf Training docs * changes for code review comments * add workload container readme Co-authored-by: WafaaT <[email protected]> * fix minor error (#994) * Update TF SPR ww42 containers partials, spec-files and dockerfiles (#998) * Sharvils/tf devcats fixes (#995) Minor fixes to SPR TF DevCatalogs --------- Co-authored-by: sharvil.shah * SPR PyTorch DevCatalogs (#993) Added Devcatalog files targeting SPR container launch * Delete SPR containers README_SPR.md (#999) * delete README_SPR.md * remove references in spec-files * fix numpy 1.24 deprecated np.float issue for MaskRCNN pytorch (#1006) * enable fp16 for distilbert (#1005) * Add pytorch and tensorflow devcatalog tables (#1008) * add table of devcatalogs * add devcatalog tables * make title changes * move files to docs folder * Fix ssd-resnet34 workloads, which are currently failing in TF-CPU nightly testing (#1013) * ssd-resnet34 training: import register_tensor_conversion_function from tensorflow.python.framework.tensor_conversion_registry, which is the current proper library * ssd-resnet34: remove horovod requirement which is preventing workload from running in TFDO nightly testing due to too-old horovod version * ssd-resnet34 training: apply register_tensor_conversion_function to bfloat16 * Update ssd-resnet34 README files to suggest horovod>=0.27.0 for training and removing horovod for inference * Liangan1/tpp bert (#1016) * Add SQuAD script for inference/training with TPP optimization * Add pretrain scripts for TPP optimization with (fast_bert API) * Align dataset/model path for fast_bert script * Update README.md * Update README.md * Update README.md * Update README.md * Fix train ENVs * Update fast_bert_pretrain.sh * Update run_pretrain_mlperf.py * Add pretrain scripts with 8 nodes * Refine scripts * modify rn50 distributed training script (#1017) * Fix ssd-resnet34 inference failure due to register_tensor_conversion_function moving from ops to tensor_conversion_registry (#1014) * Adjust CODEOWNERS * fix AG Ramesh CODEOWNER * Remove AG Ramesh. Unknown user? * modify rn50 distributed training script (#1028) * upgrade ipython to 8.10.0 to avoid vulnerability (#1024) * Fix weightsharing scripts for resnet50 v1.5 and bert large (#1027) * fix numa cores lists for weightsharing instances for both resnet50 and bert large * add --localalloc * correct links and reduce table columns (#1011) * correct links and reduce table columns * correct segmentation table * correct more tables * change dataset links and description * remove local path reference * remove relative links * changed links to point correctly * Dataset API (#1032) * add a new api to download and do minimal preprocessing if supported * add support for brca, tabformer and dureader datasets * add preprocessing support for brca dataset * update inference script to use a single socket (#1007) * set bf32 flag as env var (#1009) * Update README for fp16 ENV (#1036) * Molly/fast bert (#1037) * revert commit 925, enable RDMA CONFIG * bugfix: Args type bug fix * pytorch maskrcnn dev catalog (#977) * pytorch mask rcnn dev catalog * adding links * Update README_DEV_CAT.md Remove Asian font comma, add all precisions to export command --------- Co-authored-by: Clayne Robison <[email protected]> * add support for msmarco dataset download (#1034) * pytorch: update resnet50 readme of fp16 path (#1038) * increase mnasnet_0_5 iterations for latency mode to 5000 (#1041) * update windows.md (#1042) * Fix TF BERT large weight sharing QSS script (#1043) * fix script name to match readme, and remove printed cores lists * clean up old log files * test_multiple_jobs * delete file * test_multiple_jobs * added manual play * updated file extension * added test_multiple_jobs label * I O optimization for evaluation (#896) * Updated runs-on * Dataset API: Update DuReader dataset name and raw dataset links (#1046) * update the dureadr dataset name and raw dataset links * update readme for the dataset download command * Update main README.md language based on feedback from legal (#1051) * update the language of the main readme based on legal feeback * changes for code review comments * Update README.md --------- Co-authored-by: Clayne Robison <[email protected]> * Revert "I O optimization for evaluation (#896)" (#1061) This reverts commit 483c45020010cd8d947ff30c7f1b970d4972c003. * Fix for Horovod issue #3861 (#1071) Signed-off-by: Abolfazl Shahbazi <[email protected]> * add warmup for roberta and bert_base (#1073) Co-authored-by: diwei sun <[email protected]> * Added feature to include terms and conditions (#1062) * Added feature to include terms and conditions * Modified terms and conditions to be accepted once only * Modified terms and conditions text * Added new line at EOL * Modified dependency file name * Modified scripts to run as per user acceptance on terms and conditions * Fixed URL formatting in TnC file * Added conditions based on Terms and condition * Modified the names of the variables * Update datasets/dataset_api/terms_and_conditions.txt Co-authored-by: Mahathi Salopanthula <[email protected]> Co-authored-by: Clayne Robison <[email protected]> * Code cleanup --------- Co-authored-by: Mahathi Salopanthula <[email protected]> Co-authored-by: Clayne Robison <[email protected]> * Adding inter/intra op threads config to W-n-D data layer (#1076) * Enable Vision Transformer (#992) * fp32 precision * Enable TF BERT-large SQuAD FP16 inference (#1070) * enable bert_large float16 inference through launch_benchmark.py Co-Authored-By: Bhavani Subramanian <[email protected]> * add support for both AMP and keras MP * adding support for quickstart scripts and minor lint-related changes * adding float16 support in quickstart README * renaming float16 to fp16 * updating README about the new flag for enabling grappler AMP * updating model README --------- Co-authored-by: Bhavani Subramanian <[email protected]> * Enable TF ResNet50v1.5 FP16 inference (#1065) * enable rn50 float16 inference through launch_benchmark.py Co-Authored-By: Bhavani Subramanian <[email protected]> * reverting training-related changes * minor change - copyright year in new files * renaming float16 to fp16 and adding support for quickstart scripts * minor correction of renaming Float16 to FP16 * renaming float16 to fp16 changes * updating model README to indicate the use of FP16 --------- Co-authored-by: Bhavani Subramanian <[email protected]> * Bfloat16 support for TF-ViT model (#1077) * Add Bfloat16 support for vision transformer model * Update AMP optimizer name * Add bf16 tests * Change to multi-instance scripts for CI * Update readme and accuracy script * Address review comments * Fix test * Update main README * Enabling float16 training for ResNet50v1_5 (#1079) * Updated quickstart README files * Added model init files for ResNet50v1_5 fp16 training * Updated start.sh with fp16 precision * Enabling fp16 in main model scripts * Added fp16 precision to BaseBenchmarkUtil class * Final changes to enable model * Updated License Headers * Added a unit test for RN50 FP16 training * Updated 2 main Readme files * Enabling float16 training for BERT large / SQuADv1.1 (#1018) * Adding files to enable fp16 Bert Large training * Resolved float16 scope name conflict * Adding changes to support Loss scale optimization for the custom AdamWeightDecay Optimizer * Added a flag to switch between AMP and KMP * Changing default precision to float16 * Added changes to start.sh to support BERT large float16 training * Changing name from float16 to fp16 * Adding model_init and Readme for BERT large Float16 training * Shortening line length for unit test * Removing trailing whitespaces * Adding option for fp16 in BaseBenchmarkUtil class * Adding flag to switch between AMP and KMP weight updates * Updating TF requirement and Copyright year in all files * Corrected grammatical error in precision description and added new comments. * Updating FP16 convergence results in the README * Removing unnecessary files * Updating license headers for all the modified files * Added fp16 datatype to the quickstart bash script * Updated License year * Final quickstart script for Bert Large Squad training * Updated README for the new quickstart script * Removed verbose log flag * Updated Intel License Header to Readme * Updated 3 main Readme files * Added --amp details to Readme * Renamed squad.sh to training_squad.sh * Updated README files to include Squad Training use-case * FP16 support for TF-ViT model (#1081) * Add FP16 support * Fix test * Enable FP16 support for Distilbert Inference (Tensorflow) (#1075) * Enable FP16 for distilbert model * Add config and model_init file for distilbert fp16 * Update README; Add Unit Test for fp16 * Add quick start scripts for distilbert accuracy, latency and throughput * Add extra line at the end of scripts * Remove a unit test * Add accuracy and benchmark unit test for fp16 * Update quickstart scripts and their file permissions * Update --num-intra-threads arg in quickstart scripts; Update README * Update README.md * Update command for quickstart script in README * Gda/pr poc (#1069) * Added tests and changed logic to have precisions as an array on json file. * Fix OOM caused by incorrect thread setting. (#1084) * Ejan/model zoo quickstart (#1082) * Fix syntax for resnet50v1.5 inference * Import GPU Max and Flex Series workloads from develop-gpu (#1080) * Add GPU DLRM FP16 inference * Change to install ATS drivers from local repo * Add GPU PYT bert large FP16 Inference * fix _FusedMatmlul issue in GPU * Updated PyTorch to use the common compiler partial and added ARG for the env var file since that changes per compiler * Add package for ResNet 50 v1.5 int8 Inference pytorch gpu * Update specs & build files for alpha2 rc1 whls * Add ResNet50 v1.5 bf16 Training PYT GPU * Add wrapper package for TF GPU tool container * Update TF GPU training packages to use alpha2-rc1 * Update IPEX tools container and resnet50v1.5 models for alpha2 rc1 * Update PYT Bert LG and DLRM FP16 inference alpha2-rc1 * Update tf-gpu branch for ww15 dpcpp compiler * Set ITEX_ENABLE_ONEDNN_LAYOUT_OPT=0 for bert training * Add section to validate base container, fix dlrm printed statement * Update the docs for alpha2-rc2 models * fix ipex tool container readme * Fix dlrm print using CPU statement to be XPU * add 1t env vars * Use add instead of addn * Update bert large docs to be specific about which pretrained model to use * Sync with develop Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update the main benchmarks README for gpu models * Set ITEX_ENABLE_ONEDNN_LAYOUT_OPT=0 in ResNet50v1.5 bf16 training quickstart scripts * Revert "tmp fix res50v1_5 int8" This reverts commit 3c120e0bee3a576ee1548d9258b611a889897ee6 * Updates to match batch sizes in docs and updated pb links * Updating compilar binary * Update PYT GPU packages for IPEX alpha2 rc6 * rfcn-fp32-inference-k8s package Signed-off-by: Kam D Kasravi <[email protected]> * Update GPU specs to make the docs section a list and update TF training docs for DevCloud * Doc updates for ResNet50v1.5 and BERT large training for GPU * tf-gpu doc updates * Fix the BKC and environment for resnet50v1.5 INT8, bert-larget and resenet50v1.5 BF16 training * Update GPU PYT packages to have 2 READMEs * Remove duplicate license from package * AI Kit Model Package README * Clean up PYT model pkgs and update baremetal docs * Fix GPU tests (#5) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Sync with 'develop' and resolve conflicts (#3) * Update README.md for IPS 00513014 and 00514541 * Enable remapper pass in densenet169 execution * Adds protoc and pycocotools dependencies * K8s packages tests: Checks if username has underscore before creating a namespace * Fix and simplify serving k8s package path variables * Upgrade to 'TensorFlow Serving 2.4.0' Signed-off-by: Abolfazl Shahbazi <[email protected]> * rfcn-fp32-inference-k8s package Signed-off-by: Kam D Kasravi <[email protected]> * Quickstart updates for using synthetic data or real data, except SSD-ResNet batch will always use synthetic * Add Centos8 partials for SPR TF models * Fix the URL for 'oneAPI-samples' repo * snapshot Signed-off-by: Kam D Kasravi <[email protected]> * Add a copy of existing pytorch ipex icx centos specs to specs/centos * Fix High vulnaribility issues reported by SNYK Signed-off-by: Abolfazl Shahbazi <[email protected]> * Setting OMP_NUM_THREADS based on num_intra_threads * Weekly SNYK fixes Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fixes broken links in the Launch Benchmarks documentation * Fix '3d-unet' docker image links Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix Python and TensorFlow Pip package versions for TF v1.15.2 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Adding a minor fix to dynamically calculate the number of remaining images to be steps provided x batch size. Currently the max number of steps the RN50 inference supports is max of 5000 / batch size.. The 50k hard limit is not letting us to perform long inference runs for platform analysis. Hence requesting this fix. This will enable us to collect telemetric data (like emon) to be collected for longer duration (like 5 mins). Signed-off-by: Rajendrakumar Chinnaiyan <[email protected]> * Remove unused 'num_cores' from 'rfcn' Signed-off-by: Abolfazl Shahbazi <[email protected]> * Upgrade to 'Pillow>=8.1.2' Signed-off-by: Abolfazl Shahbazi <[email protected]> * Compatibility fixes for automation * Parameterized model name in resnet50v1.5 serving script * Increase timeout and modify output * Adjusts inceptionv3 client input and output * fix mpi operator cluster scope issue * Fixes SSD-MobileNet perf comparison by pre-installing numpy with --no-binary * Enable more models for Perf Analysis notebooks and add auto testing for notebooks * Update quickstart bare metal documentation to use ./quickstart/<script>.sh * Fix lints tests for rfcn Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add support for SSD-ResNet34 BF16 inference * Updating benchmarks table with 'SSD ResNet34 BFloat16' Signed-off-by: Abolfazl Shahbazi <[email protected]> * modifying requirements.txt in SSDRN34 to use tensorflow add-ons of any version greater than or equal to 0.11.0 * Moving quickstart files to their proper directories and bats test fix * Update specs and assembler.py to make the documentation section a list * Fix error in BF16 accuracy test for SSD-ResNet34 with input size of 1200 * Shwetaoj/horovod version * Fix pip install commands for Python3 and 'numpy' version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Updated README file for transformer_mlperf model, fixed of link of sections and added the instructions to run transformer model for both fp32 and bfloat16 inference * Update BERT large docs for to separate out "advanced" and allow for using quickstart scripts when cloning the repo * Adding DIEN model to modelzoo for inference (fp32 and bfloat16) * Fixing data format issue for SSD_RN34 and Resnet50 training models * Replaced existing mlperf transformer LT bfloat16 training model with a converged model, multi-node support is kept * Fix for accuracy flag * Fix some styles for recently merged 'DIEN' model Signed-off-by: Abolfazl Shahbazi <[email protected]> * Set 'OMP_NUM_THREADS' to 'num_intra_threads' Signed-off-by: Abolfazl Shahbazi <[email protected]> * Updated the transformer_mlperf README file, and also restore a change by accident * Fix styles and other cleanup Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update BERT Large docs to put AI kit first * Added support for frozen graph with bfloat16 precision. * Update README file and fix few errors. * Fixes for 3D-Unet Mlperf * Fix link to 'g3doc' installation Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update docs for DenseNet 169 and Faster RCNN FP32 inference * Fix `environment` spelling typo * Fix for ssd-resnet34 inference * Stock PyTorch vs Intel's optimization comparison notebook * Doc updates for AI Kit * Adding fix to ssd-resnet34 bfloat16 training * Doc updates for recommendation models for AI Kit * AI Kit doc updates for Faster RCNN * Update SSD ResNet34 backbone model links Signed-off-by: Abolfazl Shahbazi <[email protected]> * 3D U-Net AI Kit doc updates * Mask RCNN AI Kit doc updates * Doc updates for language modeling models for AI Kit * UNet doc changes for AI Kit * Fixed a bug in mlperf_transformer model real time performance measurement, which was caused by the batch size was fixed in the model. Also with some code cleaning up * Doc updates for RFCN for AI Kit * Update the docs/README.md to add a AI Kit doc link * Removing $ from shell command snippets * Doc updates for SSD-MobileNet for AI Kit * Update DenseNet169 doc to use the tensorflow conda env for AI Kit * IMZ CentOS Support for start.sh * Doc updates for WaveNet for AI Kit * Doc updates for InceptionV4 for AI Kit * WORKAROUND - Update horovod version to a commit on master branch to fix build error in horovod * rama/3d unet * Enabled user specified warmup and benchmark steps. * Merge branch 'dtran/platform_util_add' into 'develop' Added functions to expose some of the properties like core, logical core, numa nodes See merge request intelai/models!495 * update all TF images to latest Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update TF TPP link too Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add document for users who are new to docker * Update InceptionV3 docs for AI Kit * Update code to write checkpoint files to the --checkpoint dir, even when the backbone model isn't provided * Fixing the link target to the README section that lists the model's prerequisites * Update MobileNet V1 docs for AI Kit * Update ResNet50 & ResNet101 docs for AI Kit * Regenerate docs too for SSD ResNet34 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix SSD ResNet34 style and unittests Signed-off-by: Abolfazl Shahbazi <[email protected]> * Doc updates for language translation models for AI Kit * Fix typo in "advanced" setup section * Doc updates for ResNet50v1.5 for AI Kit * In-graph arg should be omitted if None for BERT BF16 inference * Changes to add num_iterations option for DIEN model * DIEN script refactoring + static graph flag + bf16 online pass support * Check for 'NOINSTALL' before running 'YUM' commands Signed-off-by: Abolfazl Shahbazi <[email protected]> * Initial commit for SSD-RN34 BF16 inference * Prepare for Model Zoo v2.4.0 release Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update output based on new graph * Sync with 'develop' and resolve conflicts * Regen documentation and dockerfiles * Update 'OWNERS' file (#4) * Update 'OWNERS' file Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add more owners Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix one last failing test * Update 'DIEN' readme (#6) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Prevent adding wheels or other archives to the repo (#7) Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: ltsai1 <[email protected]> Co-authored-by: Yimei Sun <[email protected]> Co-authored-by: Melanie H Buehler <[email protected]> Co-authored-by: Taie, Wafaa S <[email protected]> Co-authored-by: Kasravi, Kam D <[email protected]> Co-authored-by: Mahmoud Abuzaina <[email protected]> Co-authored-by: Jones, Dina S <[email protected]> Co-authored-by: Rajendrakumar Chinnaiyan <[email protected]> Co-authored-by: Yerneni, Venkata P <[email protected]> Co-authored-by: Thakkar, Om <[email protected]> Co-authored-by: Ojha, Shweta <[email protected]> Co-authored-by: Cui, Xiaoming <[email protected]> Co-authored-by: Varghese, Jojimon <[email protected]> Co-authored-by: xiaoming <xkdjfk> Co-authored-by: Khanna, Kanvi <[email protected]> Co-authored-by: mdfaijul <[email protected]> Co-authored-by: Shiddibhavi, Sharada <[email protected]> Co-authored-by: Shah, Sharvil <[email protected]> Co-authored-by: Ketineni, Rama <[email protected]> * GPU RN50v15 Inference (#11) Generate package with support for all precision Co-authored-by: Dina Suehiro Jones <[email protected]> * Adds the PyTorch GPU BERT inference package (#10) * Add PyTorch GPU BERT inference container * documentation updates * Make scripts executable * Removing these vars until we hear from mingxiao * Make brackets consistant * Formattting * fix output dir * Removing typo * Add note that says the first run will download the pretrained model * Fix which README goes in the package * Updates based on the latest bkc * Update quickstart file names in the spec * Use tee * Adds the PyTorch GPU BERT training package (#13) * Add documentation, quickstarts, and spec for PyTorch BERT training for GPU * Fix which README goes in the package * Add glue files * Updates based on the latest BKCs * Doc update and log to screen * Doc update * add support for bfloat16 (#17) * Adds the PyTorch GPU ResNet50v1.5 training package (#15) * Add ResNet50v1.5 PyTorch training model package * Update files in package * update file list for main.py * Spec update * Fixes after review * Use tee * Adds the PyTorch GPU ResNet50v1.5 inference package (#14) * Add docs, quickstart scripts, and spec for PyTorch ResNet50v1.5 for GPU * Fix file path * Doc and BKC updates * Update permissions * Moving the PyTorch DLRM GPU model code (#18) * Moving the dlrm code out of the precision folder, since it's the same for all precisions * Updates from the latest gpu-models 0.2.0gpu_rc1 branch * Update the old DLRM spec, due to moving the code * Moving DLRM inference/gpu code to be common gpu code used for both inference and training * Fixing models paths from the old spec/quickstart * Add GPU RN50v1.5 Training (#19) * added spec & generating package * removed existing folder * fixed docxumentation * fix scripts * review changes * Adds the PyTorch GPU DLRM training package (#21) * Add initial spec and docs for DLRM pytorch gpu training * Updated docs * Update permissions on quickstart * update dataset doc * Fix file path * Adds the PyTorch GPU DLRM inference package (#22) * Moving the dlrm code out of the precision folder, since it's the same for all precisions * Updates from the latest gpu-models 0.2.0gpu_rc1 branch * Update the old DLRM spec, due to moving the code * Moving DLRM inference/gpu code to be common gpu code used for both inference and training * Fixing models paths from the old spec/quickstart * Add files for the spec and documentation for DLRM pytorch GPU inference * Update quickstart file list * Documentation updates * removing old code * Update dataset instructions * Add log file analysis * Update to add download of the model weights * Doc updates * Update the datasets instructions to note that the first time the model is run, the preprocessing happens * Add init files in language modeling & tensorflow folders (#23) (#24) * add init files in language modeling & tensorflow folders * changed year Co-authored-by: Jitendra Patil <[email protected]> * Add GPU Bert Large inference (#25) * added scripts * update docs * updated scripts * removed unnecessary folders * removed spec file * fix import issues * review changes * review update 2 * Add GPU Bert Large training (#29) * initial commit * update spec file * added missing init file * update docs * deleted unwated files * review changes * gpu support for bfloat16 (#31) * updates to docs & scripts (#34) * Update pytorch bert for gpu to include transformers code (#32) * Update pytorch bert for gpu to include transformers code * BERT large inference doc updates and fixes * update to use a clone of the AI Kit conda env * Fix paths in the quickstart script * add sacremoses to the requirements * Updates to the DLRM model packages for PyTorch GPU (#35) * DLRM fixes * update training precisions * Updates for the user to download the pretrained model separately * PyTorch GPU ResNet50v1.5 updates and fixes (#38) * Add resnet models file * Fix to use tee * Updates for training * Fix log file name * Whitespace * whitespace * Updated model files from gpu-models 0.2.0gpu (b761567) * Quickstart updates * PyTorch model source from the gpu-models 0.2.0gpu branch (b761567) & add env vars (#39) * Updated models from the gpu-models 0.2.0gpu branch (b761567) * Updated BKCs * Add setting of env vars * Change warn to echo * added env parameter (#44) * Fixes for PyTorch GPU AI Kit models dependency install (#45) * PyTorch GPU fixes from SH for running from a read only directory (#46) * Updates from the PyTorch team * Set tensorboard logdir * Updates from Agnieszka's fixes_0.2.0gpu branch * Grabbing unchanged files * Reverting header year change from unchanged file * Removing old pytorch gpu model spec/dockerfiles (#50) * Revert "Removing old pytorch gpu model spec/dockerfiles (#50)" (#51) This reverts commit d716b915cf20efc437467930e48a2f829a898f55. * Removing old PyTorch GPU dockerfiles/specs that are for a specific precision (#52) * Updates for the PyTorch IPEX GPU base container package (#53) * Updates for base pytorch gpu container * Update dockerfile name * Update name of the agama sources file * Update docker image names in the doc * Switch back to intel-graphics-local.list * Doc update * Update env vars * update to use l_dpcpp-cpp-compiler_p_2021.3.0.3168_offline.sh * Update to use l_dpcpp-cpp-compiler_p_2021.3.0.3168_offline.sh * README updates * Doc update to make title match what users will see in IRC * Adds inference and training container packages for PyTorch BERT large for GPU (#57) * Add workload containers for PyTorch IPEX BERT large inference & training for GPU * Update to clarify base build and update build script to check for the base * Fixing package name * update dockerfile to use latest mkl * Make base image vars * Update run.sh to use --group-add * Regenerate dockerfiles * Syntax fix * Add Tensorflow base container (#60) * update specs * first working version * updated build * updated docs, build & spec * doc update * tabs -> spaces * Adds inference and training container package for PyTorch DLRM for GPU (#58) * rename specs * add initial files * Updated docs and add build.sh and run.sh * Fix dockerfile name * Regenerate dockerfile * update pretrained model path * Add new line at the end of build.sh files * Adds inference and training container packages for PyTorch ResNet50v1.5 for GPU (#61) * renaming specs * Generate dockerfiles * Add documentation for the wrapper package * Add build and run scripts and update spec for the wrapper package * Use makedirs to create leaf folders * syntax * syntax * Fixing broken links * Removing --do_eval for bert large training (#67) * Add Bert Large inference GPU package (#69) * initial version * updated docs * rename spec & package name * review changes * fix broken link * Add Bert Large training GPU container package (#72) * initial version * updated build & run scripts * update docs * Add ResNet50v1.5 GPU container packages (#75) * initial commit * added docs in spec * wrapper package generation * update docs * add training * updated docs * parameterize docker args (#76) * Update run.sh with docker args for PyTorch GPU container packages (#77) * update docs (#78) * RN50 training bug fix (#80) * bug fix * update batch size * GPU Bert training container package fix (#84) * initial working version * added env parm * dummy data generation integratex * more update * env fix * updated docs * Adding NDA TPP file (#87) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update PyTorch IPEX gpu wheel name (#88) * Update PyTorch IPEX gpu wheel name * Fix files in spec * Add pre-trained models for gpu container packages (#123) * added pretrained models to pacakge * update doc * add pretrained models for rn50 * volume mount fix * review changes * Update pytorch for new wheel names (#128) * Update PyTorch code from gpu-models b05e2161 (#129) * Update pytorch for new wheel names * Updated resnet50 images * Fix formatting * fix formatting in training file * Update copyright year * Updates for BERT from gpu-models * Updated dlrm code file from gpu-models * Add do eval for bert training * Updated scripts and env var * Remove UseVmBind and add EnableDirectSubmission=1 in setvars.sh * Tensorflow - 2021.3.1 NDA release (#131) * update itex binary name * itex file name fix * BKC changes * debug changes * increasing shm mem * updates * updated batch size * loggin more frequently * rolling back some changes * Add copyright to python and bash scripts files (#147) (#149) * add copyright to files * one more file (cherry picked from commit 1635120e5c7b7e6d7fff44e666533d33a47a6445) * compilre version change (#155) * PyTorch PVC updates (#188) * Add dockerfile with PVC env vars * PVC dockerfile updates * Removing pvc specific dockerfile * Renaming ATS vars to PVC * Updated gpu-models code (07854e5d09cc7f380355f8ca50ebe8bc9c09bf22) * BERT large inference and training quickstart updates * Update BERT train long analysis function parameters to add batch size * DLRM updates * Doc updates for the DLRM terabyte dataset * README updates * Update 'Ats' in message * Fix ENV in partial * Fix typo * Fix typo * Another typo :( * Revert "PyTorch PVC updates (#188)" (#194) This reverts commit 99f569ca09d6a7b333959d14fb5ce29df4e08077. * PyTorch updates for PVC pre-alpha (0.2.2) (#195) * Add dockerfile with PVC env vars * PVC dockerfile updates * Removing pvc specific dockerfile * Renaming ATS vars to PVC * Updated gpu-models code (07854e5d09cc7f380355f8ca50ebe8bc9c09bf22) * BERT large inference and training quickstart updates * Update BERT train long analysis function parameters to add batch size * DLRM updates * Doc updates for the DLRM terabyte dataset * README updates * Update 'Ats' in message * Fix ENV in partial * Fix typo * Fix typo * Another typo :( * Update Torch CCL install * Update list of quickstart scripts in the DLRM inference spec * Updated weights file for DLRM inference * Fix <package name> text replacement * update compiler and oneMKL * Update the base container README due to ipex import changes and remove --privileged * Update base container README based on review feedback * update DATASET_DIR for DLRM to remove 'day' * The DLRM dataset /day paths were correct - putting them back in) * Updates for main_int8.py * Add one CCL * Add build arg for CCL * Add l_oneapi_ccl_p_2021.4.0.423_offline.sh to the package * Switch to use base kit as an experiment * Update dockerfile for basekit * Make a separate spec for basekit for debug * Torch CCL from source * make torch-ccl directory relative * Fixing path in spec * Fix package path for torch_ccl * Go back to using wheels for Torch CCL * Fix dockerfile name for basekit build.sh * fix typo * Removing ENVs that were already defined * Make sure ONEAPI_ROOT is getting set * update image tag * update to use new wheels * pip updates to prevent dependency version warnings * PVC alpha release - Tensorflow (#201) * base container update * udpated env vars * updated models * added oneccl * build script update * fixed ccl installation * update bkc * training bkc update * fixed bf16 * remove horovod whl install * merge related fixes * code review changes * Add PyTorch PVC container package for SSD-ResNet34 Training (#269) * Add PyTorch PVC SSD-ResNet34 training spec, partial, docs, and code files * Add git * Moving partial to the ubuntu folder * regenerate dockerfile with git install * Reorder paritals * Add python3.6-dev * Add python3.7-dev * Removing precision as a requirement * Fix path * Make training.sh executable * Add env var * Update docs and add block/plain format * fix filenames in spec * Removing dockerfile that's not used * Add info on the known issue for plain format * Add note about the original repo * TF PVC 3D-UNet and MASK R-CNN containers. (#271) * initial working package & build * update build & run scripts * maskrcnn pkg generation * scripts updates * doc update * more doc update * doc updates * docs update + * 3d-unet working with basekit * scripts updates * basekit based models * fix docs * mixed precision script * update scripts * docs update * review changes * review changes 2 * Mask RCNN pre-alpha container (#276) * changes based on feedback from model owner * fix typo * fix path * fix docs links (#326) * Add model package for PyTorch SSD-ResNet34 inference for ATS-P (#339) * Updates to add ssd-resnset34 inference * update models path * Update quickstart paths * Doc and setup script updates * Add install setuptools * Doc update and model script updates * Write dllogger to a different dir * Update dllogger dir * add models folder and update to use dllogger from pip * Update doc * Add new setvars for ATS-P * No deps for torchvision install * Doc updates * Updated gpu-models code * Removing container related files since those aren't tested yet * Adding back note about original repo * putting back PVC setvars.sh * Removing JIRA links * Update PyTorch ResNet50v1.5 inference and training for AI Kit 2022.1 GPU NDA (#344) * PyTorch ResNet50v1.5 updates for AI Kit 2022.1 * Update versions in main spec * Updates years in header * Update PyTorch SSD-ResNet34 training for AI Kit 2022.1 GPU NDA GPU (#345) * Updates for SSD-ResNet34 training for AI Kit 2022.1 * Removing JIRA links * Update to note that the same conda env is used for both inference and training * tf gpu 3d-unet (#347) * Add 'Deep Learning Examples for Tensor Cores' to '3d-unet' model for TF Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix both single tile and multi tiles patches for 'UNet_3D_Medical' Signed-off-by: Abolfazl Shahbazi <[email protected]> * Pre-apply the single tile patch to 'UNet_3D_Medical' Signed-off-by: Abolfazl Shahbazi <[email protected]> * update the docs and spec for 3d-unet GPU Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update doc per review Signed-off-by: Abolfazl Shahbazi <[email protected]> * Regerate docs for 3d-unet' Signed-off-by: Abolfazl Shahbazi <[email protected]> * Addin 'Intel' header to modified files Signed-off-by: Abolfazl Shahbazi <[email protected]> * Regen docs and remove checkpoints reference for 3d-unet Signed-off-by: Abolfazl Shahbazi <[email protected]> * PyTorch GPU SSD-ResNet34 fixes (#350) * Regenerate dockerfiles * Don't have a dockerfile for this one yet * PyTorch DLRM updates for AI Kit 2022.1 (#349) * PyTorch DLRM updates for AI Kit 2022.1 * Update quickstarts * Updates the PyTorch GPU BERT inference and training model packages for AI Kit 2022.1 (#348) * BERT updates for PyTorch GPU * Update doc to note sourcing setvars.sh * add setup.sh to the specs * Fix path * Add inference models README * Update to add README for bert training * Adding rust * Update setup script for training * Updates from https://github.com/intel-innersource/frameworks.ai.pytorch.gpu-models/pull/129 * Update setup * add data folder * Fix path * Removing transformers * update dependencies * Fix pip install * Require the BERT_WEIGHT folder, since we can't write to the MODEL_DIR * Add the PyTorch 3D UNet inference model package for AI Kit 2022.1 (#351) * Add 3D-UNet for PyTorch GPU * Removing wrapper package section for now * doc updates and setup script fix * Add matplotlib install * Quickstart and doc updates * Update to note that weights file will be downloaded by the setup script * Doc and setup.sh script updates to set the BUILD_DIR * Update for setting OUTPUT_DIR instead of PRETRAINED_MODEL dir * Adding paths for the preprocess.py and the make mkdir_postprocessed_data BUILD_DIR * More path updates for run.py * Update pybind dir * Update to loadgen dir * Update to get loadgen 39 wheel * update setup install * README updates * Update setvars.sh * Update to pass build dir * Removing loadgen and nnUnet, since those are now in artifactory * Update setup.sh to move loadgen to a temp directory for install * Removing dockerfile * Updated TF GPU BKCs and docs for NDA release (#346) * Updated NDA batch sizes and added pkg READMEs * Generated docs * Revert spec/doc changes for 3D U-Net and MaskRCNN * Remove pretrained models from inference packages * add rn50 bf16 inference (#352) * PyTorch and TensorFlow fixes for the AI Kit 2022.1 NDA release (#362) * BERT fixes for writing to the model dir * Fix README references to ImageNet in the SSD-ResNet34 docs (should be COCO) * Write BERT training data to OUTPUT_DIR * Adds pip package dependency for TF 3D U-Net Co-authored-by: Melanie H Buehler <[email protected]> * PyTorch 2022.1 GPU NDA container package update (#370) * Updates to the base container for 2022.1 pytorch * Update PyTorch base dockerfiles with distutils (due to error with Python 3.9) * Updates after testing * Fix export to ENV * Updating basekit filenames * Add the PyTorch SSD-ResNet34 Inference container package for the 2022.1 GPU release (#373) * Updates to the base container for 2022.1 pytorch * Update PyTorch base dockerfiles with distutils (due to error with Python 3.9) * Updates after testing * Fix export to ENV * Container updates for SSD-ResNet34 inference * Fix to separate installs * Update docs and run.sh with the PRETRAINED_MODEL env var * Container updates for SSD-ResNet34 inference * Fix to separate installs * Update docs and run.sh with the PRETRAINED_MODEL env var * Add the PyTorch 3D UNet container package for the 2022.1 GPU release (#376) * Add dockerfile, docs, and dataset preprocessing script for the container package * Dockerfile update * Fixing missing env var * Add clang install * Fixes for preprocessing * Doc updates and remove need for extra DATASET_DIR for inference since the preprocessed dataset is in the OUTPUT_DIR * Add matplotlib to the dockerfile * updates based on review comment * Update the PyTorch DLRM inference container package to include pretrained weights (#381) * Add back the pretrained model * Fix link * Update the TensorFlow and PyTorch base container documentation to include link to the driver (#378) * Update READMEs to link the driver * Note ATS-P * Updating PyTorch BERT partials for the 2022.1 GPU release (#379) * Fix bert path * Update BERT inference partial * Fixing line * fix bert training installs * tf 2022.1 nda gpu base container cleanup (#384) * 2022.1 NDA base TF GPU container package update * Added pre-trained models back to inference specs * Update for new ITEX and TensorFlow wheels Signed-off-by: Abolfazl Shahbazi <[email protected]> * Download.md cleanup Signed-off-by: Abolfazl Shahbazi <[email protected]> * Regen Dockerfiles Signed-off-by: Abolfazl Shahbazi <[email protected]> * add -p to the mkdir * Fix incorrect 'BaseKit' version name Signed-off-by: Abolfazl Shahbazi <[email protected]> * include wheels and basekit for 3dunet and fix build args Co-authored-by: Melanie H Buehler <[email protected]> Co-authored-by: Dina Suehiro Jones <[email protected]> * take 1 (#392) * take 1 * correct file tree * remove third-party filenames * Syncing up the doc fragment with the README update for DLRM inference (#414) * CentOS, Debian, RedHat and SLES support for GPU (#418) * Add support for CentOS 7 and Debian 10, 11 (#391) * Add support for CentOS 7 and Debian 10, 11 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Replace 'dnf' with 'yum' for CentOS 7 compatibility Signed-off-by: Abolfazl Shahbazi <[email protected]> * remove commented line Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add the Yum repo fix for 'CentOS 8' Signed-off-by: Abolfazl Shahbazi <[email protected]> * Making Platform and OS check more portable (#393) * Making Platform and OS check more portable Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a minor syntax error Signed-off-by: Abolfazl Shahbazi <[email protected]> * Adding support for RedHat 7 and 8 (#394) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Finalize Red Hat and CentOS 7, 8 support (#398) * Minor fix for Red Hat support Signed-off-by: Abolfazl Shahbazi <[email protected]> * Improve OS version checking Signed-off-by: Abolfazl Shahbazi <[email protected]> * Introduce devtoolset-7 for CentOS and Red Hat 7 Signed-off-by: Abolfazl Shahbazi <[email protected]> * minor regex fix Signed-off-by: Abolfazl Shahbazi <[email protected]> * yum install consistency Signed-off-by: Abolfazl Shahbazi <[email protected]> * Adding support for SLES 15 (#399) * Adding support for SLES 15.03 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Improve SLES version check regex Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a minor typo in OS name Signed-off-by: Abolfazl Shahbazi <[email protected]> * Improve OS version checking (#401) Signed-off-by: Abolfazl Shahbazi <[email protected]> * PyTorch GPU updates to support both PVC and ATS (#416) * Add ATS-P vs PVC args and conditionals * Doc updates * Updated PVC batch size for BERT large FP32 training * Add env var * Add pci utils to the pytorch base and try out new setvars with 3dunet * update 3dunet spec setup.sh * ResNet50v1.5 inf update * Revert README changes * Update quickstart scripts and specs * Remove ATS and PVC specific setvars.sh * Remove DEVICE env from run.sh * Remove DEVICE * Remove the 'downloads' for dlrm * Update requirements to mention lscpi and apt/yum * Run accuracy testing first for 3d unet * Doc updates * TF AI Kit 2022.1.1 NDA updates for PVC (#421) * PVC vs. ATS detection for TF model packages * Small update to RN50 BF16 inference BKC * Adds pciutils requirement to documentation * Adds pciutils partial * AI Kit 2022.1.1 NDA remove TF pretrained models (#424) * Remove pretrained models and fix RN50 bs * Fixed BERT Large bf16 training bs * PyTorch container package updates for 2022.1.1 GPU NDA (#427) * PyTorch container package updates for 2022.1.1 GPU NDA * update to basekit 140 * TF container package updates for 2022.1.1 GPU NDA (#428) * TF container package updates for 2022.1.1 GPU NDA * Fix merge conflict * GPU Containers - Mount basekit from host machine (#438) * removed basekit installation * updated tf basekit build script * updated docker file * doc update and minor fixes * pytorch changes * doc updates * GPU workload containers - use basekit from host machine (#439) * tf change to use basekit on host machine * changes for pytorch workload container to use basekit from host machine * Add /opt/intel/oneapi check and volumne mount for the PyTorch 3D UNet dataset preprocessing run script * fixed error * update python path * fix error * removed ats specific envs Co-authored-by: Dina Suehiro Jones <[email protected]> * updated product and agama versions in tool container README; added main README for Container Packages (#447) * GPU Mask RCNN training package (#454) * Initial commit for MaskRCNN training model package * Removed var and regenerate README * Remove arg & update build.sh * Remove build args for basekit and components * Updated specs, partials, dockerfiles * Fixed base tag args and pip install * Corrected patch and model files * Fixed dockerfile, quickstart script, and docs * Added requirement and removed unnecessary args * Remove unnecessary files * Add Intel licence headers * bug fix for aizoo-708 (#477) * update README for missing links (#501) * update README for missing links * Update README.md * Update README.md * Update PyTorch GPU model links and removed unused files (#514) * Remove old files * Update list of PyTorch GPU models * Adds a quickstart script for ResNet50 inference with synthetic data for PyTorch GPU (#522) * Adds PyTorch ResNet50 inference GPU script that uses synthetic data * Updated scripts from gpu-models master (75b09b19ed597b4e70fc065a6d68be94406221b3) to get support for dummy data * Update to put import back to * update tools docker file linux base to 20.04 * Add dataset dir for --dummy script * Update PyTorch GPU ResNet50v1.5 synthetic data inference script to allow adjusting the number of iterations run (#545) * add --num-iterations * make num iterations a env var * Update documentation to note number of iterations for synthetic data runs * Updated wheels for the IPEX base container (#692) Co-authored-by: msalopan <[email protected]> * Add ITEX ATS-M whl updates (#696) * made changes for ITEX ATS-M * indentation changes * Update Resnet50v1.5 (#684) * Update Resnet50v1.5 * Adjust format and restore file * ATS-M TF changes (#699) * add benchmark mode for tensorflow ssd * add resnet50 benchmark mode * add rn50 * modify rn50 files * fixing tengfei PR * fixing incorrect folder changes * added licences header * fixed year Co-authored-by: Tengfei, Han <[email protected]> * merge TF base container based on new RC1 whl packages (#700) * ssd-mobilenet tf gpu spec * build…
* correct names of devcatalog files (#1160) * SPR Ubuntu READMEs modified (#1154) * Resolve Snyk critical vulnerability (#1162) * Update FLEX_DEVCATALOG.md (#1165) --------- Co-authored-by: Srikanth Ramakrishna <[email protected]> Co-authored-by: Sharvil Shah <[email protected]>
Fix a minor typo in "cloud-data-connector" package name.
Fix a minor typo in "cloud-data-connector" package name
* add Max devcatalog and change flex link (#1179) * add Max devcatalog and change flex link * add precision list * remove video groups (#1197) Co-authored-by: Jitendra Patil <[email protected]> * source env vars and remove video grps (#1203) Co-authored-by: Jitendra Patil <[email protected]> * remove video grps pvc (#1213) Co-authored-by: Jitendra Patil <[email protected]> * update driver links,paths and remove dataset coco.names (#1230) * update driver links in devcatalog * update code to include coco files mount --------- Co-authored-by: hanchao <[email protected]> * Container automation (#1292) MZ containers build & test automation using compose & test runner. * add preprocess coco (#1299) * add yolov4 env changes (#1300) * Fix frozen graph for flex series * Added tests for m1 and m3 * Clean up files for release (#1341) * update devcatalog landing page * remove build and multicard devcatalog pyt * remove build and multi-card devcatalogs * restore deleted file * Fixing Lint and Unit tests (#1344) Signed-off-by: Abolfazl Shahbazi <[email protected]> * PVC bert-large inference modification (#1332) * change paths and avoid re-download * Add scanner workflows * Synced with develop (#1348) * Upgrade transformers version to fix CVEs (#1355) * Dataset_librarian code updates (#1187) * Changed location of installation of GCP cli * Adding packaging tests * Interconnection experiment * Test updated workflow * Fixed out_report.xml path * Interconnection for GCP * Adding updates for wheel and setuptools * Experiment requirements * Changes to ignore copied files * Adding fix to handle encoding issues * Adding changes to readme * Taking current folder to get packages * Fix typo Signed-off-by: Felipe Leza Alvarez <[email protected]> * Inter connection sample * Hot fix for python 3.8 * Fix for setup code * ignoring sample file * Renaming correctly interconnection * Fix on upgrade for pip, wheel and setuptools * Interoperability POC sample fix * Fix for .sample copy * Setup for bash * Hotfix: change "test" strings for "test_unittest" so that unit tests won't use files used by samples. * Fixing requirements and setup * Removing wheel update from setup * Fixes the problem to upload dataset to GCP. * Fixed unwanted changes on main branch * Fixed aws functional test name * Upload a folder using the aws connector * Last changes on Interoperability are applied * Updated Licence * Removinb code of conduct reference, we have not * Skips row 0 from excel when creating dataframe. * Refactor * delete unused gitignore * change access keys format * Refactored names * first version license header * Remoivng readme files on this branch * Deleting files created from setup * Adding headers into main packages * Adding headers on sample code * Updating files for publishing * Get GCP credentials * Changed setup.sh file * Removed init file on interoperability folder * Big refactoring, moving data_connector into datasets * Complete sample link * Test WF * Fixed path for unit tests * Changed trigger to PR * Create sample link * Merging readmes * Removing licence only for data connector * Missed recursive flag * delete unused names * Ignore outputs of jupyter notebook * Removing commented block * Removing deprecated folder * Include headers in init files * Adding headers in all tests * Removing coverage omit from tox configuration file (This only works on MZ env) * Removed gcp auth commands instead of commenting them * Adding headers * Removed sensitive information on gcp * Solving names to public repo * Removing sample values * Fixing path on script * Removing extra files * add security file * Updating files and structure for publishing * Updates for packaging * Updating readme * Updating metadata * Updating source code * Updating gitgitnore * removing extracted files * Updating repo * removing dataset egg info * Updating file permissions * removing key * Updating imports * Removing error * removing data_connector changes * Updating conda recipes * Fixed bugs in setup.sh Added azure src and dependencies. * Removing conda folders * Updating blank space at the end of files * Updating readme * Validation/scans (intel#56) * Fixed dataset_api requirements file * Merging from data_connector * Updating gitignore * Returning depencencies * Returning training code * Creating and re naming sample files * Adding format * New readme proposals * Fix on toml to avoid refactor * Readme agenda * Conda folder is unevitable * Exclud conda and egg folders * Adding badages in main readme... will see if we should use rst format for main readme only * Simple entry point for sample doc * Change header for sub_linked section * Modifications to current lass invocation * Adding relative link to documentation in AWS main readme file * Terms and conditions requirements update * Changes on Azure Readmi file * Removing previous terms and conditions * Updating path for datasets_urls * Updating path for datasets_urls * Removing data connector changes * Updating blank last line * Updated documentation with curren code functionality * Update documentation * Added code sample for upload, download and list blobs for oauth * first definition on dcp readme for bigquery * Sample connection with oauth * Adding readme sample for gcp service account connection with GCP * Connection documentation finished * Updating TPP file * updating with feedback --------- Signed-off-by: gera-aldama <[email protected]> Signed-off-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: Miguel Pineda <[email protected]> Co-authored-by: Gerardo Dominguez <[email protected]> Co-authored-by: gera-aldama <[email protected]> Co-authored-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: aagalleg <[email protected]> Co-authored-by: Leza Alvarez, Felipe <[email protected]> Co-authored-by: ma-pineda <[email protected]> * Feature/aikitpv 828/dataset librarian code refactor (#1273) * Fixing conda recipe syntax error * Updating conda recipe * Data Connector HF for public repo (#1331) * Validation/scans (intel#56) * Licence and samples (#1350) --------- Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: gera-aldama <[email protected]> Signed-off-by: Felipe Leza Alvarez <[email protected]> Signed-off-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: Srikanth Ramakrishna <[email protected]> Co-authored-by: Jitendra Patil <[email protected]> Co-authored-by: hanchao <[email protected]> Co-authored-by: mahathis <[email protected]> Co-authored-by: lerealno <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Real Novo, Luis <[email protected]> Co-authored-by: Jesus Herrera Ledon <[email protected]> Co-authored-by: Miguel Pineda <[email protected]> Co-authored-by: Gerardo Dominguez <[email protected]> Co-authored-by: gera-aldama <[email protected]> Co-authored-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: aagalleg <[email protected]> Co-authored-by: Leza Alvarez, Felipe <[email protected]> Co-authored-by: ma-pineda <[email protected]>
* Enabling float16 training for ResNet50v1_5 (#1079) * Enabling float16 training for BERT large / SQuADv1.1 (#1018) * Enable FP16 support for Distilbert Inference (Tensorflow) (#1075) * Fix OOM caused by incorrect thread setting. (#1084) * Ejan/model zoo quickstart (#1082) * Fix syntax for resnet50v1.5 inference * Fix syntax for bert_large accuracy * Import GPU Max and Flex Series workloads from develop-gpu (#1080) * Add GPU DLRM FP16 inference * Change to install ATS drivers from local repo * Add GPU PYT bert large FP16 Inference * fix _FusedMatmlul issue in GPU * Updated PyTorch to use the common compiler partial and added ARG for the env var file since that changes per compiler * Add package for ResNet 50 v1.5 int8 Inference pytorch gpu * Update specs & build files for alpha2 rc1 whls * Add ResNet50 v1.5 bf16 Training PYT GPU * Add wrapper package for TF GPU tool container * Update TF GPU training packages to use alpha2-rc1 * Update IPEX tools container and resnet50v1.5 models for alpha2 rc1 * Update PYT Bert LG and DLRM FP16 inference alpha2-rc1 * Update tf-gpu branch for ww15 dpcpp compiler * Set ITEX_ENABLE_ONEDNN_LAYOUT_OPT=0 for bert training * Add section to validate base container, fix dlrm printed statement * Update the docs for alpha2-rc2 models * fix ipex tool container readme * Fix dlrm print using CPU statement to be XPU * add 1t env vars * Use add instead of addn * Update bert large docs to be specific about which pretrained model to use * Sync with develop Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update the main benchmarks README for gpu models * Set ITEX_ENABLE_ONEDNN_LAYOUT_OPT=0 in ResNet50v1.5 bf16 training quickstart scripts * Revert "tmp fix res50v1_5 int8" This reverts commit 3c120e0bee3a576ee1548d9258b611a889897ee6 * Updates to match batch sizes in docs and updated pb links * Updating compilar binary * Update PYT GPU packages for IPEX alpha2 rc6 * Update GPU specs to make the docs section a list and update TF training docs for DevCloud * Doc updates for ResNet50v1.5 and BERT large training for GPU * tf-gpu doc updates * Fix the BKC and environment for resnet50v1.5 INT8, bert-larget and resenet50v1.5 BF16 training * Update GPU PYT packages to have 2 READMEs * Remove duplicate license from package * AI Kit Model Package README * Clean up PYT model pkgs and update baremetal docs * Add ITEX ATS-M whl updates (#696) * made changes for ITEX ATS-M * indentation changes * Update Resnet50v1.5 (#684) * Update Resnet50v1.5 * Adjust format and restore file * ATS-M TF changes (#699) * add benchmark mode for tensorflow ssd * add resnet50 benchmark mode * add rn50 * modify rn50 files * fixing tengfei PR * fixing incorrect folder changes * added licences header * fixed year Co-authored-by: Tengfei, Han <[email protected]> * merge TF base container based on new RC1 whl packages (#700) * ssd-mobilenet tf gpu spec * build based on latest RC1 whl packages * changed horovod version * Add PyTorch SSD-Mobilenet inference for GPU (#685) * Add SSD-Mobilenet * modified some files * modify readme.sh and add link in reference.sh * add dummy data mode * modify some description * modify description about enviroment * Added rc1 update (#702) * Added oneccl whl (#704) * do not use oneccl from basekit (#705) * Add YOLOv4 (#687) * Add YOLOv4 * update README and inference.sh * modify readme and inference.sh * add dummy data mode * test lowecase * test again * modify script and description about dummy, add dummy img * add miss file * updated readme (#706) * Updated RN50 PyTorch Inference spec file (#707) - Updated names in the spec file for RN50 based on scripts in quickstart folder - Updated scripts names in run.sh * correct ssd-mobilenet and yolov4 (#709) * fix bug where only default images would be used * correct scripts * ssd-mobilenet support int8 only * pretrained waight file link is not a direct link, so remove if from script and nee user dowmload it * Modified some descriptions Co-authored-by: Feng Yuan <[email protected]> * Added Pytorch RC3 whls (#730) Co-authored-by: Tengfei, Han <[email protected]> * updating TPPs (#728) * 2.8 tpps * remove old files * Added Resnet50_Pytorch for ATS-M (#729) * Added Resnet50_Pytorch for ATS-M * Added documentation and wrapper README * Made changes as per reviews Co-authored-by: Tengfei, Han <[email protected]> * ATS-M support for SSD-Mobilenet and Resnet50V1-5 (#724) * modifying scripts for gpu ssd-mobilenet * changed docker image name * modify changes to test functionality * made changes for obj_det build * changed np version * made version changes * revert changes * add 3.9 dev version and remove 1.17.4 np version to latest * change path of coco py files in models to int8 folder * update new .pb model file * export vars and change/remove DATASET_DIR * made -f to -d change in checkir DIR path * add batch inference for ssd-mobilenet * use dummy data for online and batch inference * add untracked file * change to new models * change warmup and steps * change warmup and steps * add docs for ATS-M ssd-mobilenet * add docs section for ATS-M w/ links * add docs section for ATS-M w/ correct links * modify baremetal.md * modify spec file to add model package * generate model-builder doc * make alignment changes * update GPU name and TF version in README.md and add oneapi dir path var * unify docs of ssd-mbnet and rn50 * make rn50 doc changes and add oneapi as path var * generate model-builder readmes * correct typo * correct typo * add INT8 check,remove other precisions * add ONEAPI_DIR to array * formatting lines * delete baremetal for ATS-M * remove typo and baremetal.md * cleanup and modify readmes * create oneapi_dir for base build * remove hrvd for rc2 test * remove hvd from base build and ITEX BKC env * Delete -tf-gpu-ssd-mobilenet-inference.temp.Dockerfile * initial review changes * add prvileged mode for cpu freq scaling * remove aikit.md * correct readme typos * correct comments * minor readme changes * check dataset path only for accuracy * check dataset_dir only for accuracy * add aikit back * add gpu name and refine base readme * change docker.md on dummy data * add aikit for both models * add aikit for both resnet * add privileged mode Co-authored-by: Ramakrishna, Srikanth <[email protected]> Co-authored-by: Mahathi Vatsal <[email protected]> * Update readme (#733) * updated readmes * updated readmes again * Added ssd-mobilenet pytorch for ATS-M (#734) * Added ssd-mobilenet pytorch for ATS-M * Made changes as per reviews * Added YOLOv4 for ATS-M (#735) * Added YOLOv4 for ATS-M * Made changes as per reviews * Made changes in model.py to run yolov4. - Modified build.sh for ipex-tool-container. - Modified run.sh in yolov4 to mount PRETRAINED_MODELS * update docs * Removed HVD and torch ccl whls (#741) * Removed HVD and torch ccl whls * Removed sythentic_data scripts ffrom rn50 spec file * Removed scripts from run.sh * Update rn50, ssd-mobilenet and yolo (#748) * Update rn50,yolo and ssd-mobile * delete emulation * update model Co-authored-by: chaohan <[email protected]> * Mahathi/ipex mkl update (#753) * Added mkl/compiler packages * Added tbb in spec file * Removed oneapi path in build.sh * Modified old files Co-authored-by: Srikanth Ramakrishna <[email protected]> * dpcpp,mkl,tbb inside container ATS-M (#756) * test dpcpp,mkl in base * make partial changes * add tbb files to partial * fix typo in ttb addition * remove two export vars * remove oneapi dir check and mount * add end of line * re-add end of line Co-authored-by: Mahathi <[email protected]> * Removed oneapi from run.sh in workloads (#758) Co-authored-by: Srikanth Ramakrishna <[email protected]> * doc-level changes for ATS-M TF base and WL containers (#754) * test dpcpp,mkl in base * make partial changes * add tbb files to partial * fix typo in ttb addition * remove two export vars * remove oneapi dir check and mount * change name of gpu * change gpu name * add driver download link and remove custom paths * provide driver download link * refine typos in wl and base docs * remove onapi volume mount * remove model req and path for ITEX Co-authored-by: Mahathi <[email protected]> * Modified all README's (#757) * Modified all README's * Modified README's * update readmes Co-authored-by: Srikanth Ramakrishna <[email protected]> * Fixed typo in IPEX dockerfile (#760) * Fix styler and unit tests for develop-gpu (#777) * Fix styler and unit tests for develop-gpu Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix unittests too Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> * Sync with develop branch (#774) * Update args.rank and args.world_size for maskrcnn (#338) * Pytorch updates for SPR 2022 ww01 and resolve AIDEVOPS-703 (#330) * Updates to resolve AIDEVOPS-703 * Removing empty .dockerignore * Removing extra line * Update TF inference language modeling (BERT Large) docs for instructions to run on Windows (#342) * update tf inference language modeling for windows instructions * modify the BS of maskrcnn throughput (#356) * Fix quick start scripts links in object detection docs (#358) * Enable running models on certain num of cores (#343) * Enable running on certain num of cores * Removed hard-coded number * Checking if HT is on/off * Fixed tests and platform util for perf notebook (#361) * Update dataset to 3 RNN-T training datasets (#357) * Update dataset to 3 RNN-T training datasets In this commit, train-clean-360 and train-other-500 are added in model. These datasets need 500GB disk space to preprocess. It will take ~4 hours to run the entire 3 datasets for one epoch in BF16. You can terminate the training process by adding `num_steps` in models/language_modeling/pytorch/rnnt/training/cpu/train.sh. * Set NUM_STEPS outside of bash script * Add note that FP32 runs 100 steps * workaround to fix distributed training issue (#365) * update the BS of maskrcnn throughput (#366) * Fix maskrcnn output scirpt for ipex distributed training (#360) * Update 3D UNet MLPerf doc to run FP32 inference on windows (#367) * update 3dunet mlperf doc to run fp32 inf on windows * Fix doc links for the Windows supported models list (#368) * update links * Transformer ML-Perf SPR WW04 (#359) * Changed the attention part so that it can utilize the existing fusion of batchmatmul+mul+addv2, and also use static varibles to reduce redundant compution * fixed a minor bug for a static variable * Changed the model so that the reshape can be moved out of dense layer so that we can fuse the ops in the dense layers * Changed the depth of attention to a static variable * fix bert pre train distributed bug (#369) * Weizhuoz/fix bert ddp (#374) * tee Bert ddp to a specific log file * Add tee on phase1 * Fix maskrcnn distributed training calculation * Enable jemalloc for BERT throughput mode (#375) * update bs and use ipex Lamb (#382) * fix distribute training for DLRM and use launcher (#383) * Add a separate doc for windows env setup (#371) * add a separate doc for windows support on baremetal * use msys bash to run start.sh for windows * update supported models docs for model dependencies on Windows * fix distribute training for DLRM and use launcher (#386) * Update ImageNet Dataset preprocessing instructions (#385) * update imagenet dataset preprocessing scripts and doc * Ttitswor/snyk cli support (#340) * tables version out of date whl would not build properly on sf-client. * Updating intel-tensorflow version does not exist * Updating tensorflow-addons Version does not exist. * Updating horovod whl no longer builds successfully on Python 3.9+ * remove empty requirements.txt file sf-client will fail, no need for empty req file. * Updating Pandas version out date, whl no longer builds successfully on Python 3.9+ * Update pandas Version not longer builds whl successfully on python 3.9+. * Update numpy Version whl fails to build successfully on python 3.9+ * Updating horovod Version fails to build whl successfully on Python 3.9+. * Update SimpleITK Version does not install correctly on python version 3.9+. * Updating numpy numpy==1.16.3 does not build whl successfully on Python 3.9+. * Updating scipy scipy==1.2.0 fails to build whl successfully on Python 3.9+ * Updating h5py h5py==2.10.0 fails to build whl successfully on Python 3.9+. * Updating numpy numpy>=1.16.3 fails to build whl successfully on Python 3.9+. * Update h5py h5py==2.10.0 fails to build whl successfully on Python 3.9+. * Remove upload to GCS (#387) * Remove upload to GCS Signed-off-by: Abolfazl Shahbazi <[email protected]> * remove gcs option from the shell script Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add support for CentOS 7 and Debian 10, 11 (#391) * Add support for CentOS 7 and Debian 10, 11 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Replace 'dnf' with 'yum' for CentOS 7 compatibility Signed-off-by: Abolfazl Shahbazi <[email protected]> * remove commented line Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add the Yum repo fix for 'CentOS 8' Signed-off-by: Abolfazl Shahbazi <[email protected]> * Adding support for RedHat 7 and 8 (#394) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update COCO validation dataset instructions for bare metal and docker (#390) * update coco dataset instructions for baremetal and docker * update coco script and instructions to remove output dir env var * Add numactl partial to wide and deep (#396) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Making Platform and OS check more portable (#393) * Making Platform and OS check more portable Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a minor syntax error Signed-off-by: Abolfazl Shahbazi <[email protected]> * Improve OS version checking (#401) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Minor syntax updates for py38 or newer (#400) * Minor syntax updates for py38 or newer Signed-off-by: Abolfazl Shahbazi <[email protected]> * More Python3.8 compliant literal comparison fixes Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update training.sh (#403) change "socked_id" to "node_id" for ipex launcher * Fix tcmalloc path to set LD_PRELOAD (#388) * Fix tcmalloc.so path * Formatting * Removing debug messages * Unit test update * Test updates * Add tcmalloc to the int8 dockerfiles * Removing files we don't need * Finalize Red Hat and CentOS 7, 8 support (#398) * Minor fix for Red Hat support Signed-off-by: Abolfazl Shahbazi <[email protected]> * Improve OS version checking Signed-off-by: Abolfazl Shahbazi <[email protected]> * Introduce devtoolset-7 for CentOS and Red Hat 7 Signed-off-by: Abolfazl Shahbazi <[email protected]> * minor regex fix Signed-off-by: Abolfazl Shahbazi <[email protected]> * yum install consistency Signed-off-by: Abolfazl Shahbazi <[email protected]> * Stock TensorFlow v2.5/v2.6/v2.7 support for performance analysis notebook -(sync with develop branch Jan 26) (#377) * add back some missing patches * add TF_ENABLE_ONEDNN_OPTS support for stock TF 2.5 and above * transformer patch fix * Update README.md * online mode support * Adding support for SLES 15 (#399) * Adding support for SLES 15.03 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Improve SLES version check regex Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix a minor typo in OS name Signed-off-by: Abolfazl Shahbazi <[email protected]> * Fix BERT data instructions (#402) * add bert data instructions in a separate doc * update bert large dataset instructions * Weizhuoz/fix ipex ww05 (#404) * fix DLRM throughput output error * Modify socket_id to node_id for ipex launcher * fix data preprocessing script link for bert base and bert LT(#407) * Add kmp_blocktime arg for ResNet101 int8 (#410) * [RNN-T training] Update download_dataset.sh (#412) Align with MLPerf: Remove --speed 0.9 1.1 * Add a snippet to download COCO2014 dataset files (#411) * Fix failing unit tests (test_bare_metal and bert_fp32_inference) (#409) * Fix unit tests * benchmarks/ * Rename var so that it's not confused with the actual number of platform cores * Add socket id 0 test * Fix the link for the income census dataset download script (#413) * BERT: Enable weight sharing and remove data layer for benchmarking (#406) * Fix unit and style tests for BERT (#415) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add Jupyter notebooks for fine tuning BERT from TF Hub (#408) * Add WIP notebooks * Add question and answering notebook * Update classifier to clean up and document and add a second dataset * updated notebook and model map with more BERT models * Add README and update the ipynb name * Remove unused notebook * Update to remove the section that displays data with the predictions * Add utils file * utils comments and README update * Updated files * Clean up displaying predictions to use a pandas df * Updates after notebook clean up and add export to the q&a notebook * Retested and updates * README updates and comments/formatting in utils scripts * Add note about expecting that tensorflow has already been installed * Add notebooks to the main TL README * Add missing new lines * Add pip install ipywidgets==7.6.5 after testing on bare metal * Rename BERT Question Answering notebook * Notebook updates based on review feedback * Remove inadvertant changes * Removing empty line * PYT transfer learning notebook for object detection (#397) * Initial commit of notebook and utils * Added a README * Removed non-functioning datasets & models * Doc edit * Fixed bugs, improved explanations, suppressed warnings * Adds notebook for generic image classification (#364) * Adds image classification notebook for user datasets * Adds Image Classification transfer learning notebook * Fixed links and text * Minor doc updates * Updated for review feedback * Moved training-specific vars to TL section * Newline and license header * fix python seed (#417) * Fix DIEN no requirements.txt file found (#422) * bug fix in ssd-resnet34 (#423) * update the BS of maskrcnn throughput (#425) * Add a doc for transformer language mlperf dataset (#419) * Add a doc for wide and deep large dataset instructions (#420) * add a doc for inference dataset instructions, and updating the models docs * Doc updates for the Transfer Learning notebooks (#430) * Add the TF models dataset links to the main models table (#429) * Fix dlrm without ipex-interaction (#434) * Fix link for PyTorch RoBERTa base inference (#436) * Enable inference for PyTorch TransNetV2 (#426) * Enable inference for PyTorch TransNetV2 * enable bf16 inference for PyTorch TransNetV2 * update README * use dummy data * Add the option to use a custom dataset in the BERT binary text classification notebook using TF Hub (#435) * Add the option to use a custom dataset in the BERT binary text classification notebook using TF Hub * update bert_utils to add the download_and_extract_zip function * Updates based on review feedback * add WER for RNN-T (#440) * Update recommendation inference docs for Windows instructions (#437) * add windows instructions for dien and wide&deep inference * fix accuracy issue in 4.10 transformers in patches (#441) Co-authored-by: Jiayi Sun <[email protected]> * update pytorch maskrcnn for PT change (#442) * use multi-instances(one node for each instance) for throughput run (#443) * A new Jupyter notebook for lpot quantization tutorial and related perf analysis (#115) * draft for lpot quantization and perf analysis jupyter notebook * Update Louie/lpot perf analysis by review comments (#298) * update with formal name of model zoo, correct wrong words, add license in python file * rm empty line Co-authored-by: Neo Zhang Jianyu <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * use multi-instances for maskrcnn training (#445) * Update language translation docs for windows support (#444) * update bert and transfromer lt official docs for windows support * fix a wrong link for 3dunet readme * Update run_bert_pretrain_phase2.sh (#449) * Update run_bert_pretrain_phase1.sh (#450) * enable resnet50 training for multi sockets (#448) * Update DLRM training to train on 2S. (#451) * Launcher command shell in Windows to achieve better AI workload performance for certain Intel client hardware (#395) *Launcher command shell in Windows to achieve better AI workload performance for certain Intel client hardware * update the list of supported models on windows (#455) * Update BraTS2018 data preprocessing instructions for 3D-UNet (#452) * Fix for keras experimental for bert. (#433) * Add a PyTorch NLP fine tuning notebook using the IMDb dataset for sentiment analysis (#453) * Add the pytorch IMDB fine tuning notebook * Update markdown * Add README * Renaming notebook and main doc update * Fix link * fix path in readme * Update requirements.txt * Add datasets to requirements * Add transformers to requirements * add sklearn to requirements * Updates based on review feedback - fixing 'extends pytorch * Update the README to specify 3.9 * Use 'NeoZhangJianyu' ID from GitHub (#456) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Leslie/add runtime extension support (#457) * add runtime extension for ssd-rn34 accuracy inference * support iteration larger than dataloader * change the weight sharing script name (#461) * fine tune for dataset env var configuration (#463) * Add rn50 inference runtime extension support for throughput/accuracy (#462) * add rn50 throughput mode runtime extension support * add rn50 accuracy mode runtime extension support * Update the PyTorch Text Classification fine tuning notebook to allow using a custom dataset (#467) * Update the PyTorch Text Classification fine tuning notebook to use a custom dataset * update description at the top of the notebook to mention the custom dataset option * Add citation for the SMS text collection dataset * Update the PyTorch text classification README to note the custom dataset option * Rename the notebook and update the main TL ReadMe * Clearing notebook output * Fix syntax * Fix Transformer Language mlperf to add arg --kmp-blocktime (#469) * fix transformer mlperf to parse --kmp-blocktime, in case set on the system * Windows support for Transformer Language MLPerf inference (#471) * fix python format and update docs for instructions * Minor clean up (#459) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Updated the transformer_mlperf inference profiling option, and some minor changes in the README (#472) * Modify the output tag for IPEX DDP (#475) * remove manual conversion of models to datatype (#478) * feed sample input while prepacking for training (#479) Co-authored-by: Wang, Chuanqi <[email protected]> * Minor flake8 fix (#481) Signed-off-by: Abolfazl Shahbazi <[email protected]> * update the Pytorch URL for develop branch (#485) * Update versions and URLs for release v2.7 (#484) * Update versions and URLs for release v2.7 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Regenerate docs and dockerfiles Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update the main IMZ README.md to list models per use case (#466) * add usecases tables in the main model readme and benchmarks readme * revert bf16 changes (#488) * Add partials and spec yml for the end2end DLSA pipeline (#460) * Add partials and specs for the end2end DLSA pipeline * Add missing end line * Update name to include ipex * update specs to have use the public image as a base on one and SPR for the other * Dockerfile updates for the updated DLSA repo * Update pip install list * Rename to public * Removing partials that aren't used anymore * Fixes for 'kmp-blocktime' env var (#493) * Fixes for 'kmp-blocktime' env var Signed-off-by: Abolfazl Shahbazi <[email protected]> * update per review feedback Signed-off-by: Abolfazl Shahbazi <[email protected]> * Add 'kmp-blocktime' for mlperf-gnmt (#494) * Add 'kmp-blocktime' for mlperf-gnmt Signed-off-by: Abolfazl Shahbazi <[email protected]> * Remove duplicate parameter definition Signed-off-by: Abolfazl Shahbazi <[email protected]> * add sample_input for resnet50 training (#495) * remove the case when fragment_size not equal args.batch_size (#500) * Changed the transformer_mlperf fp32 model so that we can fuse the ops… (#389) * Changed the transformer_mlperf fp32 model so that we can fuse the ops in the model, and also minor changes for python3 * Changed the transformer_mlperf int8 model so that we can fuse the ops in the model, and also minor changes for python3 * SPR updates for WW12, 2022 (#492) * SPR updates for WW12, 2022 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update for PyTorch SPR WW2022-12 Signed-off-by: Abolfazl Shahbazi <[email protected]> * Update pytorch base for SPR too Signed-off-by: Abolfazl Shahbazi <[email protected]> * Stick with specific 'keras-nightly' version Signed-off-by: Abolfazl Shahbazi <[email protected]> * Updates per code review Signed-off-by: Abolfazl Shahbazi <[email protected]> * update maskrcnn training_multinode.sh (#502) * Fixed a bug in the transformer_mlperf model threads setting (#482) * Fixed a bug in the transformer_mlperf model threads setting * Fix failing tests Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * Added the default threads setting for transformer_mlperf inference in… (#504) * Added the default threads setting for transformer_mlperf inference in case there is no command line input * Fix unit tests Signed-off-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> * PyTorch Image Classification TL notebook (#490) * Adds new TL notebook with documentation * Added newline * Added to main TL README * Small fixes * Updated for review feedback * Added more models and a download limit arg * Removed py3.9 requirement and changed default model * Adds Kitti torchvision dataset to TL notebook (#512) * Adds Kitti torchvision dataset to TL notebook * Fixed citations formatting * update maskrcnn model (#515) * minor update. (#465) * Create unit-test github action workflow (#518) * Create unit-test github action workflow Tested here: https://github.com/sriester/frameworks.ai.models.intel-models/runs/6089350443?check_suite_focus=true Runs tox py.test on push. * Containerize job * Update unit-test.yml Changed docker credentials to imzbot * Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519) Signed-off-by: Abolfazl Shahbazi <[email protected]> * update distilbert model to 4.18 transformers and enable int8 path (#521) * rnnt: use launcher to set output file path and name (#524) * Update BareMetalSetup.md (#526) Always use the latest torchvision * Reduce memory usage for dlrm acc test (#527) * updatedistilbert with text_classification (#529) * add patch for distilbert (#530) * Update the model-builder dockerfile to use ubuntu 20.04 (#532) * Add script for coco training dataset processing (#525) * and update tensorflow ssd-resnet34 training dataset instructions * update patch (#533) Co-authored-by: Wang, Chuanqi <[email protected]> * [RNN-T training] Enable FP32 gemm using oneDNN (#531) * Update the Readme guide for distilbert (#534) * Update the Readme guide for distilbert * Fix accuracy grep bug, and grep accuracy for distilbert Co-authored-by: Weizhuo Zhang <[email protected]> * Update end2end public dockerfile to look for IPEX in the conda directory (#535) * Notebook to script conversion example (#516) * Add notebook script conversion example * Fixed doc * Replaces custom preprocessor with built-in one * Changed tag to remove_for_custom_dataset … * Change --num-inter-threads to 1 for bert-large int8 (#1091) * modify ResNet50 training (#1092) * Cherry-pick commits for GPU Flex updates from the develop-gpu branch (#1090) * Corrected typos in README (#1074) * IPEX FLEX 555 docker validation (#1060) * update IPEX flex series for new driver version * add dummy,batchsize options * clarify readme details,change docker image name * update download links * add bs and num_iter env vars * ITEX FLEX 555 docker validation (#1059) * update flex workloads for the new driver version * add batch size as env * Modified Readme for baremetal for ITEX workloads --------- Co-authored-by: Mahathi Vatsal <[email protected]> * clean old devcatalog instructions (#1086) * change precision * remove old instructions --------- Co-authored-by: Srikanth Ramakrishna <[email protected]> Co-authored-by: Mahathi Vatsal <[email protected]> * fix bert large fp32 training for cpu not to use keras_policy datatype (#1093) * add try except to avoid mkdir fail the case (#1095) Co-authored-by: xiaoman-liu <[email protected]> * develop branch: fix return_dict config (#1104) * remove extra dockerfile (#1108) * fix data buffer for DLRM while epoch > 1 (#1109) Co-authored-by: chunyuan-w <[email protected]> * exit while epoch reach args.nepoches (#1110) Co-authored-by: chunyuan-w <[email protected]> * revert changes to fix perf drop (#1101) * fix buffer_num==0 issue (#1114) Co-authored-by: chunyuan-w <[email protected]> * Update readme for dgpu workloads and IPEX cpu version (#1116) * update main readme for dgpu workloads * upgrade ipex and torch versions for cpu * pick files for multi-card release ipex (#1119) * pick files for multi-card release ipex * pick files for release ITEX multi-card (#1111) * pick files for release * Update flex_multi_card_batch_inference.sh * Update flex_multi_card_online_inference.sh * remove edits from unrelated files,rename file * TensorFlow Linux CI Temporary Change of 3d_unet_mlperf Model (#1121) * Update requirements.txt * fixed license issue in dataset api (#1126) * rnnt: fix _joint_step_batch for stock pt path (#1123) * rnnt: fix stock pt path and refactor code (#1124) * rnnt: refactor ipex fp32 & bf16 * refactor with or without ipex path; convert embed dtype for stock pt path; * port changes from validation for more scenario support * simplify embed dtype conversion and jit * remove torch.compile for now since seems incorrect * fix graph_mode * remove redundant space * Update requirements.txt for rfcn model (#1128) * Decouple TF ResNet50v1.5 GPU/CPU model scripts (#1133) * decouple training scripts for cpu and gpu * decouple inference scripts * update unittests * fix pythonpath * fixed vulnerabilities for snyk scan (#1134) * fixed vulnerabilities for snyk scan * Enable Vision transformers inference on CPU (#1102) * enable hf vit model * enable vit inference * Update README.md * fix patch (#1137) * Update requirements.txt (#1135) * fix dlrm ddp training local variable Batch referenced before assignment" (#1139) * Fix some quickstart scripts for optional ARGS (#1138) * fix some quickstart scripts for image recognition * Updated with latest ITEX and TF version (#1131) * Updated readmes to refer to the latest ITEX instructions * Add support for MVTEC-AD dataset in dataset API (#1099) * add mvtec dataset download support * add preprocessing support * update for not to remove the raw data file after extraction * use wget to reduce download time * display the wget logs * update requirements.txt * add one more data file for dureader * update broken links in devcatalog (#1140) * update broken links and update filename * Updated with latest IPEX and torch (#1130) * Added IPEX latest documentation * Corrected batch size for bfloat16 (#1098) * Corrected batch size for bfloat16 * Ejan/test quickstart (#1145) * Update latency calculation in quick start script * Fix distilbert script * Add fix for 3dunet quickstart --------- Co-authored-by: shahbaazsyed <[email protected]> * Ejan/test mobilenet v1 (#1150) * Take out bfloat16 env settings * Expect correct vision images path (#1143) * Expect correct vision images path * remove wget * add wget to setup.sh * add sudo * Revert "add sudo" This reverts commit 77ef66abac66e203b423948f76322b153aa63743. * update preprocessing scripts for brca --------- Co-authored-by: WafaaT <[email protected]> * Enabling MobileNetv2 (#1088) * Model Enabling for MobileNetv2 * Updated unit tests * release clean up for PVC containers (#1146) * release clean up for PVC containers * revert license dates * Ejan/dien quickstart (#1153) * Fix dien quickstart * Data Connector integration to Model Zoo (#1136) * Add data connector Co-authored-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: Leza Alvarez, Felipe <[email protected]> Co-authored-by: Miguel Pineda <[email protected]> Co-authored-by: Gerardo Dominguez <[email protected]> Co-authored-by: aagalleg <[email protected]> * Revert changes in ssd-mobilenet int8 cpu scripts (#1155) * revert changes in ssd-mobilenet int8 cpu scripts * update unittests * SPR Ubuntu READMEs modified (#1154) * correct names of devcatalog files (#1160) * reverting the changes for BERT fp16 inference with keras MP (#1141) * Resolve Snyk critical vulnerability (#1162) * Updated branch for SDLe scans (#1148) * Changed declaration of output * Update FLEX_DEVCATALOG.md (#1165) * fix for 'hashlib.md5' bandit scans (#1157) Signed-off-by: Abolfazl Shahbazi <[email protected]> * Enable GPT-J/bloom inference for fp32/bf16/bf32/int8/calibration (#1168) * Enable GPT-J/bloom inference for int8/fp32/bf16/bf32 * Refine bloom-176b inference (#1170) * Enable Stable Diffusion inference for fp32/bf16/fp16/int8/calibration (#1172) * add stable diffusion * modify scripts * modify inference_realtime.sh * enable int8 * modify scripts and README * add calibration script * Numpy parameter constraint (#1171) Co-authored-by: dhermosi <[email protected]> Co-authored-by: Wafaa Taie <[email protected]> * Fix in DIEN model,for changes in Tensorflow framework (#1125) * Fix for array_ops change * Remove some check for latest tf * Change seq test to list test * Enabling MMoE training with bfloat16 and fp16 precisions (#1149) * enabling MMoE training with bfloat16 and fp16 precisions * fixing coverage tests * changing arg model-dir to output-dir * modifying tf_mmoe_args.json * correcting --output-dir type on model_init * coding style fixes * adding coverage tests for bfloat16 and fp16 training * Changes to fix horovod issue (#1163) * add license for stable-diffusion scripts (#1173) * Dlrm v2 (#1151) * copy dlrm v2 from mlperf repo * remove gpu/distribute/mlperf-log/fused-optimizer related code * enable ipex.optimize with fp32/bf16/fp16, INT8 blocked by trace issue * add data_process folder * add log for performance and add script to enable torchrec dlrm inference/training fp32/fp16/bf16 * enable int8 * enable int8 * add intel license * add Max devcatalog and change flex link (#1179) * add Max devcatalog and change flex link * add precision list * Update CODEOWNERS * Update requirements.txt (#1178) Co-authored-by: justkw <[email protected]> * Add scripts to create model zoo bits for aikit (#1177) * add scripts to create model zoo bits for aikit * changes for code review * Liangan1/remove configure file (#1188) * Add README.md * Remove configure.json * modify maskrcnn training script (#1184) * Liangan1/update bloom model (#1193) * Add README.md * Remove configure.json * Update model to bloom-1b-4-zh * Revert "Remove configure.json" This reverts commit a17d0ae219acf2a0bc6d8e0ef73cbadec35a2522. * Update READEME * Update code owners list (#1186) * only remove torch hub when rn101 (#1200) * only remove torch hub when rn101 * rm -resnext_wsl_model_names due to duplication with hub_model_names * download weight only with pretrained==true --------- Co-authored-by: XiaobingZhang <[email protected]> * Maskrcnn solver steps (#1189) * split solver steps to avoid use () in command line * rm inf change due to no SOVLER.STEPS in inf * split solver steps to avoid use () in command line * rm inf change due to no SOVLER.STEPS in inf * Use gcc for horovod installation (#1202) * test workaround * try flag * try gcc * Enable weight sharing for INT8 and BF16 distilbert (#1097) * Enable weight sharing for INT8 distilbert * enable bf32 for resnet50 training (#1210) * remove video groups (#1197) Co-authored-by: Jitendra Patil <[email protected]> * source env vars and remove video grps (#1203) Co-authored-by: Jitendra Patil <[email protected]> * remove video grps pvc (#1213) Co-authored-by: Jitendra Patil <[email protected]> * add environment variables GLOBAL_BATCH_SIZE and LOCAL_BATCH_SIZE for resnet50 and maskrcnn distributed training (#1212) Co-authored-by: liangan1 <[email protected]> * Update run_multi_instance_throughput.sh (#1211) * add workflow to automate mz drop to aikit (#1217) * Gda/scans (#1194) * Updated scans * Test scans on github.head_ref branch Fixed merge conflicts Fixed unit test requirements * update model zoo drop workflow (#1218) * update to use runners and run in container * update mz drop workflow (#1220) * changes to clone oneapi tools repo (#1222) * fix create bits script (#1223) * Fixed version for snyk scan (#1191) * Fixed version for snyk scan * test mz drop worflow (#1224) * fix create bits script * update create mz aikit bits script (#1226) * comment out drop to artifactory code (#1227) * update pytorch gpu yolov4 scripts (#1225) * update pytorch yolov4 scripts * update readme --------- Co-authored-by: Srikanth Ramakrishna <[email protected]> * Gda/checkmarx (#1221) * Test checkmarx scan * Test PR tests * Test Snyk and Checkmarx scans * Test bandit scan * update driver links,paths and remove dataset coco.names (#1230) * update driver links in devcatalog * update code to include coco files mount --------- Co-authored-by: hanchao <[email protected]> * Create CI/CD pipeline orchestrator workflow (#1229) * Create CI/CD pipeline orchestrator workflow * Rename top layer workflow job names * Update preprocessing scripts for BRCA dataset (#1231) * update scripts for brca * Cicd orchestrator (#1236) * Create CI/CD pipeline orchestrator workflow * Rename top layer workflow job names * Add scheduled execution for CI/CD pipeline * Fix jobs uses paths * Cicd orchestrator (#1237) * Create CI/CD pipeline orchestrator workflow * Rename top layer workflow job names * Add scheduled execution for CI/CD pipeline * Fix jobs uses paths * Add starting job to gather all following jobs in a common root * Send fixed paths to remote * Cicd orchestrator (#1238) * Create CI/CD pipeline orchestrator workflow * Rename top layer workflow job names * Add scheduled execution for CI/CD pipeline * Fix jobs uses paths * Add starting job to gather all following jobs in a common root * Send fixed paths to remote * Change workflow file extension * modify stable_diffusion accuracy script (#1235) * Provide explicit inputs to workflow call (#1242) * enable int8-bf16 mixed datatype for stable diffusion (#1241) * use --memory-allocator jemalloc replace default_allocator (#1247) * Enable int8-fp32 for BertLarge (#1120) * Update script for the HF model (#1251) * Update script for the HF model * update model link * Fixing batch size param in transformer training (#1244) * Dataset_librarian code updates (#1187) --------- Signed-off-by: gera-aldama <[email protected]> Signed-off-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: Miguel Pineda <[email protected]> Co-authored-by: Gerardo Dominguez <[email protected]> Co-authored-by: gera-aldama <[email protected]> Co-authored-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: aagalleg <[email protected]> Co-authored-by: Leza Alvarez, Felipe <[email protected]> Co-authored-by: ma-pineda <[email protected]> * Dien fix for training failure (#1252) * Fixing SSD-RN34 Training accuracy/convergence with NHWC format (#1249) * Fixing SSD-RN34 accuracy with NHWC format * Fixing RFCN inference model script to use the right numactl parameters (#1253) * Fixing RFCN model script * Fixing unit test * Add inputs to manual CI/CD workflow execution (#1254) * Add inputs to manual CI/CD workflow execution * Add schedule execution commet * Change default value for is_lkg_drop flag to true * enable fp16 for resnet50 training (#1258) * Enable GNN models in Model Zoo based on PyG (#1106) * Enable graph classification of inference in model_zoo * Enable training * Add quickstart for inference and training * Rebase with PyG master --------- Co-authored-by: jiayisunx <[email protected]> * Disable failing scanners * Remove dependency on commented out scanners * Fix precision value for TF workload test execution * Enable MZ drops in CI/CD pipeline * Add default value for test step * Fixing conda recipe syntax error (#1261) * Feature/aikitpv 828/dataset librarian code refactor (#1273) * Fixing conda recipe syntax error * Updating conda recipe * Maskrcnn print (#1262) * add printing maskrcnn model info into debug level * comment print info * Weizhuoz/fix numactl emr (#1263) * Fix RN50 and RN101 THP, use launcher --throughput-mode * distilbert use all numa, and reduce latency steps to 250 * fix latency numactl for SNC=2 EMR * Adding llm models inference generation for llama/gptj/bloom and lora training for llama (#1259) * draft adding llama training and inference * llm model common enabling * Update README.md * model refine * Update finetune.py * Update prompter.py --------- Co-authored-by: liangan1 <[email protected]> Co-authored-by: leslie-fang-intel <[email protected]> * output fps info for DLRM-v2 (#1277) * Add temporary Docker Hub personal credentials * Move Docker Hub credentials to the image section * Feature/aikitpv 828/dataset librarian code refactor (#1280) * Fixing conda recipe syntax error * Updating conda recipe * Updating python requirements version * Updating python requirements version * Updatin python version from requirements * Updatin pypi package SHA --------- Co-authored-by: Wafaa Taie <[email protected]> * change torchccl_version (#1287) * add preprocess coco (#1299) * add yolov4 env changes (#1300) * Added split ratio as sub arg in brca preprocess (#1284) * Added split ratio as sub arg in brca preprocess * Modified dataset_api readme * modify rn50 training script (#1310) * modify maskrcnn training script (#1312) * Revert "buried Jenkinsfile test (#1309)" (#1317) This reverts commit 3a5583fa78a348a6bb031a781da2936117874065. * Fix C++17 build issue of RNN-T training (#1314) * Fix inceptionv3 latency regression (#1307) * Fix frozen graph for flex series * enable accuracy test for dlrm-v2 (#1323) * Unify the patch for Transformers model and upgrade Transformers to 4.28.1 (#1302) * upgrade transofmers to 4.28.1 and unify all of the transformers models' patch * pr refine * refine patch * fix patch * fix int8 acc * Dlrmv2 (#1328) * fix auc compute for log * enable jit/prof * change inference batch size from 16 to 32K * enable bf32 (#1329) * update DLRM int8 config with correct calibration set (#1330) * Update mz-workload-tests.yml Updated internal container image to use for workload tests * Data Connector HF for public repo (#1331) --------- Signed-off-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: aagalleg <[email protected]> Co-authored-by: Gerardo Dominguez <[email protected]> Co-authored-by: Leza Alvarez, Felipe <[email protected]> Co-authored-by: Miguel Pineda <[email protected]> Co-authored-by: ma-pineda <[email protected]> Co-authored-by: gera-aldama <[email protected]> * tests for m1 and m3 (#1301) * add pytorch m1 and m3 in one container * streamline validation pre-process * add new scripts for yolov4 * combine fps for m3 * review devcatalogs Co-authored-by: Jitendra Patil <[email protected]> * Clean up files for release (#1341) * update devcatalog landing page * remove build and multicard devcatalog pyt * remove build and multi-card devcatalogs * restore deleted file * Checkmarx fix for not updated input names (#1343) Fixed wrong names in workflow call input names * Fixing Lint and Unit tests (#1344) Signed-off-by: Abolfazl Shahbazi <[email protected]> * PVC bert-large inference modification (#1332) * change paths and avoid re-download * fix for accuracy (#1345) Co-authored-by: mahathis <[email protected]> * Update Scanner_Snyk.yml with new name in one-ci-cd repo * Update Scanner_Snyk.yml with refs input instead of ref * remove deprecated API replace_lstm_with_ipex_lstm (#1347) * add Throughput keyword (#1295) * add Throughput keyword * use tqdm.format_dict to record thp info * disable pbar before print thp, to avoid final wrong output * add enter before print thp --------- Co-authored-by: jiayisunx <[email protected]> * Update Scans.yml with latest Snyk scanner golden workflow changes * Update mz-workload-tests.yml with git dependency install (#1351) * modify maskrcnn script (#1352) * Upgrade transformers version to fix CVEs (#1355) * Use DockerHub account on workfload tests Moved back to Python 3.8 public Docker image from Docker Hub * adding logs for indicating start and end of an iteration (#1357) * adding logs for indicating start and end of an iteration (#1356) * adding logs for indicating start and stop of an iteration (#1333) * Fixed version to avoid snyk vulnerability (#1363) * minor changes in the readme (#1365) * drop last and add readme (#1368) * Enable local batch size param for dlrm distribtued training (#1369) * Fixed Batch size param for transformer fp32 (#1367) * enable local batch size for distributed training (#1370) * Use local batch size for maskrcnn distributed training (#1371) * Enable local_batch_size for RNN-T distributed training (#1374) * update README for resnet50 and maskrcnn (#1373) * fix sq api (#1379) * add config argument (#1384) * Fix typo of DLRM script (#1375) Co-authored-by: jiayisunx <[email protected]> * Add AVX check logic to workload tests workflow * Disable schedule CI/CD execution * add command to install tfgnn from start.sh (#1382) * fix dlrm batch size issue (#1386) * Add more logs for vit training (#1354) * Remove saving the model * Add no. steps as configurable in quickstart script * Add start/stop logs * move throughput before evaluate * removed accuracy script in gha scripts for debug (#1388) * removed accuracy script for debug * removed accuracy script for debug * Transformers patch for keras nightly (#1387) * Minor changes to capture correct data * Added new patch * Better reporting * Add requirements file * Licence and samples (#1350) --------- Signed-off-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: Leza Alvarez, Felipe <[email protected]> Co-authored-by: Gerardo Dominguez <[email protected]> Co-authored-by: Miguel Pineda <[email protected]> Co-authored-by: ma-pineda <[email protected]> Co-authored-by: aagalleg <[email protected]> Co-authored-by: gera-aldama <[email protected]> * set find_unused_parameters to true for DDP training (#1389) * Update README.md (#1390) * fix ssd rn34 ddp issue (#1391) * TF DistilBERT - Update model with new benchmark scripts (#1327) * Add separate benchmark script for Distilbert to use same input repeatedly. * Add start/stop logs for each iteration * Update unit tests for distilbert * Update distilbert script to select weight sharing option (#1392) * Sd benchmark use dummy dataset (#1398) * do not load data for benchmark test * change argument name * Ejan/3dunet accuracy fix (#1383) * Add SimpleITK version * Change simpleitk version * Change tables version * Clean-up models (#1404) * clean up models * fix unit tests * upgrade mlflows to fix CVEs (#1403) * fix not found links (#1405) * revert changes in CODEOWNERS file * remove torchrec_dlrm --------- Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: gera-aldama <[email protected]> Signed-off-by: Felipe Leza Alvarez <[email protected]> Signed-off-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: Kanvi Khanna <[email protected]> Co-authored-by: akhilgoe <[email protected]> Co-authored-by: Syed Shahbaaz Ahmed <[email protected]> Co-authored-by: gera-aldama <[email protected]> Co-authored-by: Lu Teng <[email protected]> Co-authored-by: ellie-jan <[email protected]> Co-authored-by: ellie.jan <[email protected]> Co-authored-by: Shahbazi, Abolfazl <[email protected]> Co-authored-by: Jones, Dina S <[email protected]> Co-authored-by: Patil, Jitendra <[email protected]> Co-authored-by: Zhu, Wei2 <[email protected]> Co-authored-by: Sheng, Yang <[email protected]> Co-authored-by: Wang, Chuanqi <[email protected]> Co-authored-by: Liu, River <[email protected]> Co-authored-by: Robison, Clayne B <[email protected]> Co-authored-by: ltsai1 <[email protected]> Co-authored-by: Yimei Sun <[email protected]> Co-authored-by: Melanie H Buehler <[email protected]> Co-authored-by: Mahmoud Abuzaina <[email protected]> Co-authored-by: Rajendrakumar Chinnaiyan <[email protected]> Co-authored-by: Yerneni, Venkata P <[email protected]> Co-authored-by: Thakkar, Om <[email protected]> Co-authored-by: Ojha, Shweta <[email protected]> Co-authored-by: Cui, Xiaoming <[email protected]> Co-authored-by: Varghese, Jojimon <[email protected]> Co-authored-by: mdfaijul <[email protected]> Co-authored-by: Shiddibhavi, Sharada <[email protected]> Co-authored-by: Shah, Sharvil <[email protected]> Co-authored-by: Ketineni, Rama <[email protected]> Co-authored-by: nedsouza <[email protected]> Co-authored-by: Vincent Zhang <[email protected]> Co-authored-by: mahathis <[email protected]> Co-authored-by: sramakintel <[email protected]> Co-authored-by: Chao1Han <[email protected]> Co-authored-by: Tengfei, Han <[email protected]> Co-authored-by: Feng Yuan <[email protected]> Co-authored-by: Mahathi Vatsal <[email protected]> Co-authored-by: chaohan <[email protected]> Co-authored-by: jiayisunx <[email protected]> Co-authored-by: Vlad Silverman <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: leslie-fang-intel <[email protected]> Co-authored-by: WeizhuoZhang-intel <[email protected]> Co-authored-by: liangan1 <[email protected]> Co-authored-by: zhuhaozhe <[email protected]> Co-authored-by: Tyler Titsworth <[email protected]> Co-authored-by: Jing Xu <[email protected]> Co-authored-by: blzheng <[email protected]> Co-authored-by: jianan-gu <[email protected]> Co-authored-by: Neo Zhang Jianyu <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: Srini511 <[email protected]> Co-authored-by: Sean-Michael Riesterer <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: xiaofeij <[email protected]> Co-authored-by: Rahul Nair <[email protected]> Co-authored-by: Veena2207 <[email protected]> Co-authored-by: xiangdong <[email protected]> Co-authored-by: Huang, Zhiwei <[email protected]> Co-authored-by: Sharvil Shah <[email protected]> Co-authored-by: wyang2 <[email protected]> Co-authored-by: zofia <[email protected]> Co-authored-by: Cui, Yifeng <[email protected]> Co-authored-by: LuFengqing <[email protected]> Co-authored-by: Li, Guizi <[email protected]> Co-authored-by: Wang, Yanzhang <[email protected]> Co-authored-by: FengXiongIntel <[email protected]> Co-authored-by: xiaoman-liu <[email protected]> Co-authored-by: gaurides <[email protected]> Co-authored-by: ke1ding <[email protected]> Co-authored-by: Tyler Titsworth <[email protected]> Co-authored-by: ratnampa <[email protected]> Co-authored-by: Jesus Herrera Ledon <[email protected]> Co-authored-by: ke1ding <[email protected]> Co-authored-by: dhermosi <[email protected]> Co-authored-by: justkw <[email protected]> Co-authored-by: lerealno <[email protected]> Co-authored-by: hanchao <[email protected]> Co-authored-by: DiweiSun <[email protected]> Co-authored-by: Miguel Pineda <[email protected]> Co-authored-by: Gerardo Dominguez <[email protected]> Co-authored-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: aagalleg <[email protected]> Co-authored-by: Leza Alvarez, Felipe <[email protected]> Co-authored-by: ma-pineda <[email protected]> Co-authored-by: Mustafa <[email protected]> Co-authored-by: Real Novo, Luis <[email protected]> Co-authored-by: sachinmuradi <[email protected]> Co-authored-by: Ashiq Imran <[email protected]> Co-authored-by: Cao E <[email protected]>
* update driver version (#1429) * p0 ipex rn50 ATS-M (#1426) * add ipex stable diffusion * change base image P0 ITEX rn50 (#1431) * MaskRCNN ATS-M container (#1417) * p0 ipex stable diffusion (#1424) * yolov5 p0 ipex ATS-M (#1425) * itex atsm stable diffusion (#1418) * P0 ITEX Efficientnet B0,B3 (#1411) * EOLing docker builder files for workload containers (#1437) * removing dockerfiles directory * removed docker builder spec, partials * change precision to lowercase (#1456) * Update IPEX cpu baremetal instructions (#1451) * clean up ipex baremetal instructions * update horovod version in docs (#1458) * Remove all software.intel.com links (#1381) * Corrected software.intel.com * Removed dev catalog pages for EOL models * Added and updated baremetal README for P0 GPU models (#1447) * updated the GPU readme * PYT SPR BERT Large (#1472) * add avx-fp32 * Adapt newer BKC * remove idsid * update base image * updated tpp files for 2.12.1 release (#1479) * updated tpp files * added yolo5 * another update to TPPs (#1503) * resolve merge conflicts * Bump mlflow in /datasets/cloud_data_connector/samples/interoperability (#1492) Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.5.0 to 2.6.0. * Bump mlflow in /datasets/cloud_data_connector/samples/azure (#1491) Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.5.0 to 2.6.0. * fix issues with resolving conflicts * P0 models list (#1500) * sync with r2.12.1 --------- Co-authored-by: mahathis <[email protected]> Co-authored-by: Srikanth Ramakrishna <[email protected]> Co-authored-by: Tyler Titsworth <[email protected]> Co-authored-by: Sharvil Shah <[email protected]> Co-authored-by: Jitendra Patil <[email protected]>
--------- Signed-off-by: Felipe Leza Alvarez <[email protected]> Signed-off-by: Abolfazl Shahbazi <[email protected]> Signed-off-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: YanbingJiang <[email protected]> Co-authored-by: jiayisunx <[email protected]> Co-authored-by: Real Novo, Luis <[email protected]> Co-authored-by: ma-pineda <[email protected]> Co-authored-by: jojivk-intel-nervana <[email protected]> Co-authored-by: xiaofeij <[email protected]> Co-authored-by: WeizhuoZhang-intel <[email protected]> Co-authored-by: jianan-gu <[email protected]> Co-authored-by: liangan1 <[email protected]> Co-authored-by: leslie-fang-intel <[email protected]> Co-authored-by: zhuhaozhe <[email protected]> Co-authored-by: lerealno <[email protected]> Co-authored-by: gera-aldama <[email protected]> Co-authored-by: xiangdong <[email protected]> Co-authored-by: Jitendra Patil <[email protected]> Co-authored-by: Srikanth Ramakrishna <[email protected]> Co-authored-by: mahathis <[email protected]> Co-authored-by: Clayne Robison <[email protected]> Co-authored-by: akhilgoe <[email protected]> Co-authored-by: Jesus Herrera Ledon <[email protected]> Co-authored-by: Felipe Leza Alvarez <[email protected]> Co-authored-by: aagalleg <[email protected]> Co-authored-by: Gerardo Dominguez <[email protected]> Co-authored-by: Leza Alvarez, Felipe <[email protected]> Co-authored-by: Miguel Pineda <[email protected]> Co-authored-by: sachinmuradi <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: Om Thakkar <[email protected]> Co-authored-by: Ashiq Imran <[email protected]> Co-authored-by: Mahmoud Abuzaina <[email protected]> Co-authored-by: Kanvi Khanna <[email protected]> Co-authored-by: Cao E <[email protected]> Co-authored-by: Syed Shahbaaz Ahmed <[email protected]> Co-authored-by: ellie-jan <[email protected]> Co-authored-by: zofia <[email protected]> Co-authored-by: Lu Teng <[email protected]> Co-authored-by: Mao, Yunfei <[email protected]> Co-authored-by: yisonzhu <[email protected]> Co-authored-by: DiweiSun <[email protected]> Co-authored-by: zengxian <[email protected]> Co-authored-by: Mahathi Vatsal <[email protected]> Co-authored-by: Tyler Titsworth <[email protected]> Co-authored-by: okhleif-IL <[email protected]> Co-authored-by: Harsha Ramayanam <[email protected]> Co-authored-by: jianyizh <[email protected]> Co-authored-by: nhatle <[email protected]> Co-authored-by: Sharvil Shah <[email protected]> Co-authored-by: Gopi Krishna Jha <[email protected]>
* Bump mlflow in /datasets/cloud_data_connector/samples/azure (#1698) Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.6.0 to 2.8.1. - [Release notes](https://github.com/mlflow/mlflow/releases) - [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.md) - [Commits](mlflow/mlflow@v2.6.0...v2.8.1) --- updated-dependencies: - dependency-name: mlflow dependency-type: direct:production * Bump mlflow in /datasets/cloud_data_connector/samples/interoperability (#1697) Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.6.0 to 2.8.1. - [Release notes](https://github.com/mlflow/mlflow/releases) - [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.md) - [Commits](mlflow/mlflow@v2.6.0...v2.8.1) --- updated-dependencies: - dependency-name: mlflow dependency-type: direct:production * update intel tf version to be the latest (#1748)
* Remove MLFlow dependency. Updates on functional tests (#1756) * update mlflow version to use the latest (#1768) * remove license headers from .txt files in data-connector (#1774) --------- Co-authored-by: Jesus Herrera Ledon <[email protected]>
* update dlrmv2 BKC (#1476) * use merge-emb-cat for int8 since acc issue is fixed in IPEX (#1477) * forcing numpy to use a specific version (#1474) * Restrict vit training to single socket (#1484) * update dlrm/bert distribute training BKC (#1486) * change batch size for int8 resnet50 and ssd-resnet34 (#1487) * Added command to remove existing logs in output_dir (#1475) * Added command to remove existing logs in output_dir * Fix condition to checkout OneAPI tools repository (#1490) * Hz/dlrm ddp (#1496) * fix dlrm ddp * fix time computation --------- Co-authored-by: Weizhuo Zhang <[email protected]> * fix dlrm-v1 int8 thp (#1497) * also use merged-emb-cat in dlrm-v2 int8 thp (#1498) * updated tpp files for 2.12.1 release (#1479) * updated tpp files * added yolo5 * P0 models list (#1500) * P0 models list * replace master w/ tag * correct framework name --------- Co-authored-by: Jitendra Patil <[email protected]> * another update to TPPs (#1503) * Fixing SSD-Resnet34 training quickstart script to run right number of instances (#1493) * Container GHA Pipeline Reformat (#1462) * swap runner to mlops runner * change ipex base image (#1440) * rename tests yaml (#1450) * New Test Runner (#1461) * add execute perms to quickstart and add 140 tests to pytorch resnet with new runner * add tests per new format * add flex140 support * Update Test Runner (#1467) * Flex 140 tests for P0 (#1469) * add previous m3 commits (#1478) * GHA tests for flex 140 (#1499) * add previous m3 commits (#1478) * Added command to remove existing logs in output_dir (#1475) * address PR review (#1501) * remove makefile * Remove caas reference (#1502) * Add previous m3 commits in baremetal readme (#1480) --------- Co-authored-by: Srikanth Ramakrishna <[email protected]> Co-authored-by: mahathis <[email protected]> * refine dlrm ddp dataloader (#1504) Co-authored-by: Weizhuo Zhang <[email protected]> * workaround oneccl bad termination issue for RN50 distributed training (#1508) * Fix Test Pipeline (#1514) * fix test pipeline * Update container-pipeline-tester.yml * Bump mlflow in /datasets/cloud_data_connector/samples/interoperability (#1492) Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.5.0 to 2.6.0. * Bump mlflow in /datasets/cloud_data_connector/samples/azure (#1491) Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.5.0 to 2.6.0. * Zufang/readme update for itex (#1485) * add link to int8 PB for onednn graph * refine readme for onednn graph option * MaskRCNN GPU training (#1513) * maskmrcnn model demo zero-bkc * update readme * added license header * added HW requirement * update docs * support bf32 for SD finetune (#1521) * fix dlrm-v1 ddp hang (#1512) * fix dlrm-v1 ddp hang * comment out two more barrier * modify multi-node scripts for resnet50, maskrcnn and stable diffusion (#1537) * modify multi-node scripts for RNNT and ssd-resnet34 (#1539) * Modify Test Runner Run Dir from GHA (#1541) * Adjust test runner path to be MLOps root * move to test-runner dir * merge dir and parent_dir * remove parent dir * use full paths * get test name for artifact upload * Stop workload tests and security scans on open PR event (#1543) * PVC P0 RN50 PYT Inference (#1494) * updated to latest BKC to add multi-card multi-tile support * PVC P0 PYT BERT Large (#1495) * Add support for multi-card multi-tile * PVC P0 PYT DLRM training (#1518) * add dlrm training pvc support * Fix LKG pipeline for the new ipex and itex conda installers (#1525) * update LKG pipeline for the new ipex and itex conda installers * remove version subdir from bom file * TF RN50V1_5 P0 Max Inference (#1529) Co-authored-by: Mahathi Vatsal <[email protected]> * modify SD inference scripts (#1553) * Updates + New Transfer Learning Notebooks from TLT team (#1522) * Update for all transfer learning notebooks --------- Co-authored-by: Harsha Ramayanam <[email protected]> * Rename the repo to Intel AI Reference Models for rebranding (#1473) * Rename the repo for rebranding, remove k8s and tools directories * Update README.md --------- Co-authored-by: Clayne Robison <[email protected]> * modify SD finetune scripts (#1555) * fix CrossEntropyLoss target for dummy inputs (#1556) * PVC P0 PYT DLRM inference (#1517) * add dlrm pvc inference support * Modified baremetal README(bert and rn50) for max series (#1520) * fix itex (#1527) * fix itex * fix key error * PVC BERT-Large P0 TF (#1534) * adapt new BKC * Modified Bert large TF baremetal readme for Mx series (#1557) --------- Co-authored-by: Mahathi Vatsal <[email protected]> * Add optimizations for BF16 transformer inference (#1489) * Change the order of operations from dense->concat->split_heads to dense->split_heads->concat (attention_layer.py) * Change the order of operations when calculating encoder-decoder k, v caches to avoid using matmul ops between large matrices(transformer.py) *Reduce number of occurrences of split_heads with encoder-decoder k, v caches by performing split_heads in transformer.py * Update README.md (#1516) Change BFloat16 model description to point to new frozen graph. * make llama training max-step flexible (#1563) * Fix drops logic to be avoided if any workload test fails (#1564) * Create selective PR validations tags-based (#1549) * Create selective PR validations tags-based * Add edit_pull_request as trigger for the PR validations --------- Co-authored-by: Wafaa Taie <[email protected]> * PYT CPU Automation (#1544) * make initial changes * add tests for new base container * add more new tests * remove env var * add more tests * more test added * add dlrm inference build and test * add more tests * add more tests * add another model test * add final tests * add devcatalog (#1566) * Change success condition from previous jobs before doing drop (#1568) * Added readme for DLRM pytorch MAX series (#1570) * update docs for AI Tools (#1567) * ViT Train : Enable multi instance training for Tensorflow Vision Transformer model (#1569) * Revert "Restrict vit training to single socket (#1484)" This reverts commit ae2c801. * Add multi instance support * Update README.md * Remove useless files and add license title for DLRM v2 (#1574) * Gda/step url (#1560) * Added step url to result table * Remove continue on error from workload tests * Run performance checks even when a workload test failed * add driver setup doc (#1571) Co-authored-by: Jitendra Patil <[email protected]> * LLM models using ipex.optimize_transformers for bf16/int8 (#1562) * init pr * revise v1, local test passed * TF ResNet50v1.5 Fix (#1575) * tf cpu r50 inf fixed * actions.json added * newline added for actions * venv instllation added * pip install fixed * venv pip install fixed * test workflow reverted back --------- Co-authored-by: Jitendra Patil <[email protected]> * added int8 support for graphsage (#1536) * MEMREC DLRM Inference (#1547) * Added omp_num_threads and cores_per_instance as env variable (#1573) * esnet50v1.5 and bert large inference models * TF Stable Diffusion: Download model files in start.sh and log latency & throughput (#1581) * adding model download to start.sh to avoid failure during multi-instance execution * downloading the clip tokenizer in start.sh also * script changes to report both latency and throughput * Doc fixes (#1582) * add torchrec dlrm to the models table * fix paths to run quickstart scripts * Add RN50 INT8 Calibration file (#1545) * Jupyter notebook for AI Reference models (#1583) * Added jupyter notebook for AI Reference models * Added README for AI_Reference jupyter notebook * Supports Resnet50 v1.5 and mobilenet v1 inference workloads * Upgrade Pillow version to 10.0.1 to fix high severity CVEs (#1584) * remove workflows * correct_release_tag (#1587) * correct_release_tag * revert a change * Unexpose old TF models (#1593) * tf cpu distilbert inf (#1612) * 3D Unet MLPerf Inference Workload (#1595) * 3D Unet MLPerf added * docker compose added * batch size corrected * numactl added * ubuntu Dockerfile updated * output dir changed in tests * yes flag added in Dockerfile * default OS added * BERT Large Inference CPU Workload Added (#1594) * BERT Large Inf added * TCMALLOC added * ubuntu Dockerfile updated * TCMalloc location updated * test file updated * yes flag added in Dockerfile * TF CPU Bert Large Training Workload (#1596) * bert large pretraining added * extra OS removed from r50 inf service * ubuntu Dockerfile updated * ssh helper script added * yum non tineractive update * output dir fixed * TF CPU DIEN Inference Workload (#1598) * TF CPU MobileNet V1 Inference Workload (#1600) * TF CPU ResNet v1.5 Training Workload (#1604) * TF CPU SSD ResNet-34 Inference Workload (#1606) * TF CPU SSD ResNet-34 Training Workload (#1607) * TF CPU Transformer MLPerf Inference Workload (#1597) * TF CPU Transformer MLPerf Training (#1608) * TF CPU DistilBERT fixed (#1629) * TF CPU SSD MobileNet Inference Workload (#1605) * Fixed typo in readme for framework (#1631) * Checkpoints added for TF CPU Workloads (#1637) * TF CPU Dev Catalog READMEs Updated. (#1652) * EMR PYT RN50 Infer (#1624) --------- Co-authored-by: Jitendra Patil <[email protected]> * EMR PYT RN50 Train (#1625) * build rn50 train centos * comment conda lines * comment conda lines * remove fp16 test,add devcat and intel-openmp * add changes to ubuntu * remove commented lines * add cpu to tag * EMR PYT ResNext Infer (#1619) * add initial commits for emr resnext * add dockerfiles * build resnext * remove extra precisions * add devcat,openmp and more tests * add cpu to tag * EMR PYT MaskRCNN Infer (#1621) * add maskrcnn inference * add cmake and bf32 tests * add openmpi,more tests and devcat * add cpu to tag * rearrange pip installs * EMR PYT MaskRCNN Train (#1622) * build maskrcnn training * add cmake * correct image names * add ld_predload,devcat and tests changes * EMR PYT SSD-ResNet34 Train (#1618) * add compose changes * ssd-resnet34 build * correct tests file * add more tests and devcat * EMR PYT SSD-ResNet34 Infer (#1617) * build ssd-resnet34 images * add bf32 tests and update DEVCATALOG.md * rename devcatalogs (#1671) * EMR PYT BERT Large Infer (#1623) * add bert-large build * correct paths * add devcatalog * add more tests * Rename EMR_DEVCATALOG.md to DEVCATALOG.md * Update DEVCATALOG.md * PYT EMR BERT-Large Train (#1639) * build bert-large training * add pretrained model env * remove idsid * add more tests and devcatalog * correct env and rename * Delete EMR_DEVCATALOG.md * Update DEVCATALOG.md * EMR PYT Distilbert Infer (#1620) * build distilbert images * validate distilbert * add more tests and devcatalog * remove MZ reference * modify env params * uncomment and remove idsid * clarify core per instance * clarify hf_datasets * remove void env var * Rename EMR_DEVCATALOG.md to DEVCATALOG.md * Update DEVCATALOG.md * EMR PYT RNNT Inference (#1616) * add dockerfiles for rnnt * fix pytorch binding error * copy diff file to inference * add librosa * add more tests and devcatalog * correct reatime cmd * Rename EMR_DEVCATALOG.md to DEVCATALOG.md * Update DEVCATALOG.md * EMR PYT RNNT Train (#1615) * Rename EMR_DEVCATALOG.md to DEVCATALOG.md * Update DEVCATALOG.md * EMR PYT DLRM Infer (#1626) * Update DEVCATALOG.md * EMR PYT DLRM Train (#1628) * build dlrm training * add num_batch * add tcmalloc * add tcmalloc * add devcatalog * re-locate the file * Rename EMR_DEVCATALOG.md to DEVCATALOG.md * Update DEVCATALOG.md * make batch flexible (#1635) * Change dataset for Transfer Learning LLM Notebook (#1576) * update llm notebook with code alpaca * push updates * fixed broken link * Refactor Transfer Learning Notebook folder to match TLT structure + small diff (#1670) * refactor to match TL structure + small diff * fixed table structure * add landing page doc (#1653) * add landing page doc * simplify and add r3.1 * add precisions * add tf landing page * add precisions --------- Co-authored-by: Jitendra Patil <[email protected]> * r3.1 fixes (#1679) * TF CPU ResNet 50 v1.5 Inference Model Checkpoints fixed (#1663) * spr removed from workdir * R50 Inf fixed * fixed rn50 error * fixed in docker compose yaml --------- Co-authored-by: Sharvil Shah <[email protected]> * make minor corrections in devcatalog README (#1680) * Remove old TF models (#1673) * remove ResNet50, FasterCNN, RFCN, NCF, Wide and deep Large dataset training, waveNet, Inception v4, mlperf GNMT models * remove relevant unit tests and update coverage precentage * refine changes based on feedback (#1684) * fix typo (#1686) * fixing docker iamages names * fixing TF centos docker images link * unset KMP AFFINITY for accuracy scripts (#1689) * Bump mlflow in /datasets/cloud_data_connector/samples/azure (#1698) Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.6.0 to 2.8.1. - [Release notes](https://github.com/mlflow/mlflow/releases) - [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.md) - [Commits](mlflow/mlflow@v2.6.0...v2.8.1) --- updated-dependencies: - dependency-name: mlflow dependency-type: direct:production * Bump mlflow in /datasets/cloud_data_connector/samples/interoperability (#1697) Bumps [mlflow](https://github.com/mlflow/mlflow) from 2.6.0 to 2.8.1. - [Release notes](https://github.com/mlflow/mlflow/releases) - [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.md) - [Commits](mlflow/mlflow@v2.6.0...v2.8.1) --- updated-dependencies: - dependency-name: mlflow dependency-type: direct:production * add optional args to devcatalog pages (#1750) * add optional env for tf * add optional args * add v2 to table (#1762) * remove space * Corrected updating OMP NUM THREADS (#1759) * validate omp_num_threads and cores_per_instance * revert changes * validate omp num threads and cores per instance (#1789) * add omp_num_threads and cores_per_instance (#1809) * wsl2 documentation (#1815) * add wsl2 stable diffusion documentation * add wsl2 stable diffusion documentation * add wsl2 base doc * make minor tabular changes * make minor tabular changes * add ssh instructions * re-word example * update torch version outputs * add entry in main readme --------- Co-authored-by: Jitendra Patil <[email protected]> * fix dlrm normal training to support SNC mode (#1699) Co-authored-by: Weizhuo Zhang <[email protected]> * update torch-ccl branch (#1716) * Pytorch ResNext32x16d baremetal EMR tests (#1640) * Pytorch RN50 baremetal EMR inference and training tests (#1641) * PyTorch SSD-Resnet34 EMR baremetal training and inference tests (#1647) * PyTorch distilBERT baremetal tests (#1650) * PyTorch BERT_LARGE_SQUAD inf baremetal tests (#1651) * PyTorch BERT_LARGE Training baremetal tests (#1654) * Pytorch MaskRCNN baremetal EMR tests (#1646) * PyTorch DLRM baremetal tests (#1649) * PyTorch RNN-T EMR baremetal tests (#1648) * remove TF yolo v5, add cpu stable diffusion * [Zero-BKC][ITEX][GPU]add itex stable diffusion,EfficientNet,wide and deep inference for ats-m (#1642) Co-authored-by: XumingGai <[email protected]> * remove pb files from the github repo (#1730) * modify rn50 training script (#1732) * Refactor new zero bkcs scripts for TF ResNet50 inf and Mask-RCNN to models_v2 (#1695) * move new zero bkcs for TF resnet50 inf and maskrcnn to models_v2 * GHA tests for dGPU zero copy BKC format workloads (#1744) --------- Co-authored-by: Mahathi Vatsal <[email protected]> * modify Stable Diffusion finetune script (#1746) * add utilities to parse result for pytorch (#1729) * Wliao2/add rn50 (#1693) * add dgpu resnet 50 * update intel tf version to be the latest (#1748) * Enable Inductor path for Bert_large inference and training (#1733) * Init Bert-large files from inductor path * cherry pick Enable int8-mixed-bf16 for 5 transformer models (#1720) * modify README * Enable Inductor path for Distilbert-base inference (#1734) * Init Bert-large files from inductor path * cherry pick Enable int8-mixed-bf16 for 5 transformer models (#1720) * modify README * Init Distilbert base models * Init DLRM Script (#1739) * Enable Inductor path for RN50 inference and training (#1718) * Enable Inductor path for RN50 inference and training * add bf32 * add README for Torch inductor --------- Co-authored-by: leslie-fang-intel <[email protected]> * Move and add license headers P1 ATS-M ITEX models (#1754) * move and add license headers to stable diffusion model * move and add license headers to efficientnet * update maskrcnn inference * move and update license headers for wide and deep model * Remove MLFlow dependency. Updates on functional tests (#1756) * add v2 to table (#1762) * Weizhuoz/fix bert accuracy (#1761) * fix bert-large accuracy read issue * fix bert_large accuracy issue * inductor int8 could not use model.eval() * remove ssdmobilenet and yolov4 (#1757) * update maskrcnn, bert-large training (#1666) * update maskrcnn training * update maskrcnn and add bert-large * move maskrcnn training to model_v2, update license * code review changes for bert large training --------- Co-authored-by: Wafaa Taie <[email protected]> * update mlflow version to use the latest (#1768) * [Zero-BKC][ITEX][GPU] Add resnet50 and 3dd-unet training (#1664) * add 3d-unet gpu training * add gpu resnet50 training * update license headers and move scripts to models_v2 * code review changes for 3d-unet training, update license, move to models_v2 * changes in readme for code review --------- Co-authored-by: Wafaa Taie <[email protected]> * enable distributed training for DLRMv2 and some fix for inductor path (#1770) * enable distributed training for DLRMv2 and some fix for inductor path * add missing files * remove license headers from .txt files in data-connector (#1774) * fix data loader (#1780) * Fix for TL Notebooks GHA (#1773) * fixed file paths * changed venv creation * trying venv * trying no venv * reverted venv3 * testing apt get update * added apt-get install * venv3 --> venv * uncommented apt * without virtualenv * added pip install virtualenv * downgraded PyYaml to be conpatible with tf models official * 2.12.0 --> 2.12.1 * removed package versions * added back tf official version * flipped * addressed review comments * Fix inductor path int8 bf16 realtime issue (#1767) * Update inference_performance.sh (#1784) * fix acc (#1798) * add warm up iter for inference throughput (#1799) * Predownload weights for SDv2.1 (#1797) * predownload weights for SDv2.1 * update hash * fix DLRM V1 syntax error (#1760) * fix DLRM V1 syntax error * fix dlrm inductor numpy.bool_ error (#1448) * fix DLRM V1 train_ld issue * Update dlrm_s_pytorch.py --------- Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: zhuhaozhe <[email protected]> * Fix bert large inductor int8 accuray failure in last batch (#1803) * fix bert large inductor int8 accuracy issue * Format fixes --------- Co-authored-by: jianan-gu <[email protected]> * add bert (#1701) * add bert IPEX Co-authored-by: Wafaa Taie <[email protected]> Co-authored-by: Mahathi Vatsal <[email protected]> * Wliao2/add dlrm kaggle (#1711) * add dlrm kaggle * fix license issue * Update README.md * remove unused files * update * add licence header for modified files --------- Co-authored-by: Mahathi Vatsal <[email protected]> * update bert-large for ARC (#1816) * update scripts and Readme for ARC * Update README.md * Update quickstart scripts with env variables for Stable Diffusion and ResNet50v1.5 (#1791) * adding env vars for SD and RN50 * updating accuracy quickstart and other minor changes * using default values only if the env var are not set from the cmd line * fix for coverage tests * Added GHA for ITEX wide deep large (#1820) * Added GHA for ITEX wide deep large * Added stable diffusion inference ITEX (#1819) * Added stable diffusion inference * Changed file permissions * Added GHA for EfficientNet ITEX (#1821) * Added GHA for EfficientNet ITEX * Update run_test.sh * Update setup.sh * Update README.md * Added GHA for ITEX bert large Training (#1822) * ssdmobilenet int8 accuracy fix (#1811) * ssdmobilenet int8 accuracy fix * added change in quickstart accuracy script * modified public bucket link * modified new args in unit test * fixed unit test * add BKC for DLRM-V2 convergence test (#1824) * wsl2 documentation (#1815) * add wsl2 stable diffusion documentation * add wsl2 stable diffusion documentation * add wsl2 base doc * make minor tabular changes * make minor tabular changes * add ssh instructions * re-word example * update torch version outputs * add entry in main readme --------- Co-authored-by: Jitendra Patil <[email protected]> * bert-large inductor uses int8_bf16 mix (#1792) * Bert-large inductor use int8-bf16 mix * ipex uses int8_bf16 mix in if * merge develop * inductor uses int8-bf16 mix * Distilbert int8 optimization (#1830) * optimize distilbert int8 * re-calibrate distilbert * fix for inductor (#1834) * TF- DistilBERT - Update quickstart scripts with env vars (#1818) * Add Env variables to quickstart scripts * Update # of cores for throughput script * add distilbert (#1702) * add distilbert * Corrected refactored path for distilbert * Added intel license header * Update README.md --------- Co-authored-by: Mahathi Vatsal <[email protected]> * Update dlrm_s_pytorch.py (#1843) * Wliao2/add stable diffusion (#1705) * add stable_diffusion * update some typo * fix license issue * update stable diffusion * update for acc * verify the result * Refactored to new folder * Update README.md * add support for ARC --------- Co-authored-by: Mahathi Vatsal <[email protected]> * Fix Bert Large Int8 Latency Issue (#1859) Co-authored-by: jianan-gu <[email protected]> * [DistilBert] modify for masked_fill default value (#1868) * Nhatle/bert large training x3 vs x1 (#1776) * set num-inter-threads=2 * bert-large squad: Binding process to cores on 1 socket * Enable multi-instance training for bert-large squad * Fix incase users only run 1 instance * Fix benchmark_command * Molly/ddp bkc update (#1873) * make num_iter flexbile * bugfix for bert-large ddp * bkc for rn50 ddp training update * bkc for rn50 ddp training update * bkc for dlrm_v1 ddp training update --------- Co-authored-by: WeizhuoZhang-intel <[email protected]> * Corrected IPEX installer version (#1878) * Changed IPEX installer version * Update dlrm_s_pytorch.py (#1879) * Update AI Bundle version in tests setup files for CI/CD pipeline (#1881) * doc: document models_v2 contribution guideline (#1855) Signed-off-by: Dmitry Rogozhkin <[email protected]> * Molly/inductor fp16 (#1875) * make num_iter flexbile * bugfix for bert-large ddp * bkc for rn50 ddp training update * bkc for rn50 ddp training update * bkc for dlrm_v1 ddp training update * rn50 fp16 torch.compile enabled * fp16 autocast fix * fix RN50 for fp16 torch.compile (#1849) * enable stable-diffusion fp16 inductor path * vit, bert-large fp16 enable * merge to latest transformers patch * Update enable_ipex_for_transformers.diff * Update enable_ipex_for_transformers.diff --------- Co-authored-by: Cao E <[email protected]> Co-authored-by: WeizhuoZhang-intel <[email protected]> * Fix in case mpi_num_processes_per_socket=1 (#1885) * Fix in case mpi_num_processes_per_socket=1 * small fix * Update dlrm_s_pytorch.py (#1890) * Modified dataset path (#1894) * added GHA for ITEX bert large * Stable Diffusion PYT Flex and Max (#1853) * validate sd pyt * add max tests and dockerfile * Bump scipy in /models_v2/pytorch/stable_diffusion/inference/gpu (#1847) Bumps [scipy](https://github.com/scipy/scipy) from 1.9.1 to 1.11.1. - [Release notes](https://github.com/scipy/scipy/releases) - [Commits](scipy/scipy@v1.9.1...v1.11.1) --- updated-dependencies: - dependency-name: scipy dependency-type: direct:production * Bump gitpython in /models_v2/pytorch/distilbert/inference/gpu (#1846) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.30 to 3.1.41. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.30...3.1.41) --- updated-dependencies: - dependency-name: gitpython dependency-type: direct:production * Bump transformers in /models_v2/pytorch/distilbert/inference/gpu (#1840) Bumps [transformers](https://github.com/huggingface/transformers) from 4.25.1 to 4.36.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.25.1...v4.36.0) --- updated-dependencies: - dependency-name: transformers dependency-type: direct:production * Bump transformers in /models_v2/pytorch/bert_large/inference/gpu (#1810) Bumps [transformers](https://github.com/huggingface/transformers) from 4.11.0 to 4.36.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.11.0...v4.36.0) --- updated-dependencies: - dependency-name: transformers dependency-type: direct:production * Bump transformers (#1786) Bumps [transformers](https://github.com/huggingface/transformers) from 4.30.0 to 4.36.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.30.0...v4.36.0) --- updated-dependencies: - dependency-name: transformers dependency-type: direct:production * Bump transformers (#1785) Bumps [transformers](https://github.com/huggingface/transformers) from 4.30.0 to 4.36.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.30.0...v4.36.0) --- updated-dependencies: - dependency-name: transformers dependency-type: direct:production * Added GHA tests for stable diffusion (#1887) * validate sd pyt * update for ARC (#1781) * update for ARC * update log * refactor to models_v2 * update path due to refactor * sync with 2.1rc3 * Wliao2/add dlrm (#1704) * add dlrm v2 * fix license issue * Update dlrm_dataloader.py * Update dist_models.py * Update dlrm_dataloader.py * Update dist_models.py * refactored to a new folder * update Readme --------- Co-authored-by: Mahathi Vatsal <[email protected]> * Max 3D-Unet container support (#1832) * add masrkcnn container support * add 3d-unet container support * Added GHA for 3d Unet Training ITEX (#1823) * Added Resnet50v1.5 and maskrcnn train GHA test (#1751) * Refactored resnet50v1_5 for Zero copy BKC format (#1897) * Added Necessary Metadata and Bug Fixes for Transfer Learning Notebooks (#1825) * fixed file paths * changed venv creation * added pip install virtualenv * downgraded PyYaml to be conpatible with tf models official * 2.12.0 --> 2.12.1 * added back tf official version * addressed review comments * fixed version for fsspec, removed llm test * changed accelerate version * fix for tf-models-official * added needed metadata * made tests significantly less expensive * fixed zip extract to tar extract * fixed sms download * fixed typo for csv path name * addressed review comments * simplified if statement * Added oneapi path (#1902) * validate sd pyt * Update README.md with ipex version (#1903) * Update README.md with ipex version * Max MaskRCNN container support (#1831) * add masrkcnn container support * Max RN50 container validation (#1829) * validate container for zero-bkc for rn50 max container * Max BERT-Large container support (#1833) * add bert-large container support * Flex Wide and deep container (#1851) * validate zero-copy bkc for itex stable diffusion * validate zero-copy bkc for flex container (#1772) * validate zero-copy bkc * EfficientNet Container for flex (#1771) * validate zero-copy bkc efficientnet * TF MaskRCNN container for Flex GPU (#1755) * adapt zero-copy bkc and validate maskrcnn * validate bert-large inference PYT PVC (#1841) * validate bert-large inference * validate bert-large container PVC pytorch (#1838) * validate bert-large container * RN50 PYT Max container (#1904) * validate refactor of zero-bkc training * Latest updates to TF RN50 for Flex series (#1813) * adapt zero-copy bkc for image build * Update README.md for IPEX versions (#1907) * Update README.md * not cast and ramdomrized crossnet bias for inductor and make warmup iters as an arg (#1906) * resolve merge conflicts (#1911) * Wliao2/add ssdmbv1 (#1817) * add ssd-mobilenetv1 Co-authored-by: Mahathi Vatsal <[email protected]> * add IPEX Max 3dunet (#1706) * add 3dunet for IPEX for 3DUnet for Max series * Updated baremetal readme for distil bert IPEX (#1908) * Updated baremetal readme distil bert for IPEX * DistilBERT inference container PYT Flex and Max (#1854) * add functional support * docker: fix broken docker-compose.yml (#1913) Fixes: 473d3b3 ("DistilBERT inference container PYT Flex and Max (#1854)") Signed-off-by: Dmitry Rogozhkin <[email protected]> * docker/flex: fix build and run for tf maskrcnn (#1896) Signed-off-by: Dmitry Rogozhkin <[email protected]> * Flex PYT DLRM-v1 inference (#1895) * build dlrmv1 container * Updated baremetal readme for DLRM v1 (#1909) * Updated baremetal readme for DLRM v1 * Update README.md (#1914) * remove extra test (#1916) * release docs for containers (#1915) * Update release container table * Updated main README table (#1919) * Updated main README table * clean up workflows * restore git submodules --------- Signed-off-by: Dmitry Rogozhkin <[email protected]> Co-authored-by: jianan-gu <[email protected]> Co-authored-by: zhuhaozhe <[email protected]> Co-authored-by: Om Thakkar <[email protected]> Co-authored-by: sachinmuradi <[email protected]> Co-authored-by: Cao E <[email protected]> Co-authored-by: mahathis <[email protected]> Co-authored-by: lerealno <[email protected]> Co-authored-by: DiweiSun <[email protected]> Co-authored-by: zengxian <[email protected]> Co-authored-by: Weizhuo Zhang <[email protected]> Co-authored-by: Jitendra Patil <[email protected]> Co-authored-by: Srikanth Ramakrishna <[email protected]> Co-authored-by: Mahmoud Abuzaina <[email protected]> Co-authored-by: Tyler Titsworth <[email protected]> Co-authored-by: jiayisunx <[email protected]> Co-authored-by: zofia <[email protected]> Co-authored-by: Mahathi Vatsal <[email protected]> Co-authored-by: okhleif-IL <[email protected]> Co-authored-by: Harsha Ramayanam <[email protected]> Co-authored-by: Clayne Robison <[email protected]> Co-authored-by: jianyizh <[email protected]> Co-authored-by: nhatle <[email protected]> Co-authored-by: gera-aldama <[email protected]> Co-authored-by: Real Novo, Luis <[email protected]> Co-authored-by: Sharvil Shah <[email protected]> Co-authored-by: Ashiq Imran <[email protected]> Co-authored-by: Gopi Krishna Jha <[email protected]> Co-authored-by: leslie-fang-intel <[email protected]> Co-authored-by: Sharvil Shah <[email protected]> Co-authored-by: Nick Camarena <[email protected]> Co-authored-by: xiangdong <[email protected]> Co-authored-by: wenjun liu <[email protected]> Co-authored-by: XumingGai <[email protected]> Co-authored-by: wincent8 <[email protected]> Co-authored-by: Jesus Herrera Ledon <[email protected]> Co-authored-by: XumingGai <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: Syed Shahbaaz Ahmed <[email protected]> Co-authored-by: Xuan Liao <[email protected]> Co-authored-by: Dmitry Rogozhkin <[email protected]>
* fix typos * Remove unsupported precision
remove sklearn
Forcing merge. This is unpublished code. I'll do a proper PR against `develop`
Co-authored-by: Jitendra Patil <[email protected]>
* Fixed typos * Changed case for Arc
* add rnnt Co-authored-by: Mahathi Vatsal <[email protected]> Co-authored-by: Srikanth Ramakrishna <[email protected]>
* Remove FP16 scope * Enable RN50 inference with XLA
* modify maskrcnn inference scripts
* ipex/efficientnet: Fix no AMP CUDA path - No AMP path for CUDA in warmup section was missing cast to correct dtype. Adding this back resolves the issue. Signed-off-by: Voas, Tanner <[email protected]> * ipex/efficientnet: Fix typos and innacurate statements - Typos and unchecked parameter in run_model.sh - Typos in README.md Signed-off-by: Voas, Tanner <[email protected]> * ipex/efficientnet: Correct warmup and synchronization process - We were doing warmup but not returning the warmed up model - We should synchronize after each batch Signed-off-by: Voas, Tanner <[email protected]> * ipex/efficientnet: rework summary format - Dima: This commit reworks summary format for ipex/efficientnet sample according to designed schema - Tanner: Minor cleanup and typo fixes Signed-off-by: Dmitry Rogozhkin <[email protected]> Signed-off-by: Voas, Tanner <[email protected]> --------- Signed-off-by: Voas, Tanner <[email protected]> Signed-off-by: Dmitry Rogozhkin <[email protected]> Co-authored-by: Dmitry Rogozhkin <[email protected]>
* Fix names and remove scripts
* Updated baremetal DLRMv2 readme Co-authored-by: Srikanth Ramakrishna <[email protected]>
lerealno
requested review from
ashahba,
claynerobison,
jitendra42,
lerealno and
Mahathi-Vatsal
as code owners
August 2, 2024 21:58
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Inference DLRMv2 on CPU using dlrm_main.py
Issue:
This PR is for this issue fixing.